mbox series

[RFC,0/9] pci: portdrv: Add auxiliary bus and register CXL PMUs (and aer)

Message ID 20240529164103.31671-1-Jonathan.Cameron@huawei.com
Headers show
Series pci: portdrv: Add auxiliary bus and register CXL PMUs (and aer) | expand

Message

Jonathan Cameron May 29, 2024, 4:40 p.m. UTC
Focus of this RFC is CXL Performance Monitoring Units in CXL Root Ports +
Switch USP and DSPs.

The final patch moving AER to the aux bus is in the category of 'why
not try this' rather than something I feel is a particularly good idea.
I would like to get opinions on the various ways to avoid the double bus
situation introduced here. Some comments on other options for those existing
'pci_express' bus devices at the end of this cover letter.

The primary problem this RFC tries to solve is providing a means to
register the CXL PMUs found in devices which will be bound to the
PCIe portdrv. The proposed solution is to avoid the limitations of
the existing pcie service driver approach by bolting on support
for devices registered on the auxiliary_bus.

Background
==========

When the CXL PMU driver was added, only the instances found in CXL type 3
memory devices (end points) were supported. These PMUs also occur in CXL
root ports, and CXL switch upstream and downstream ports.  Adding
support for these was deliberately left for future work because theses
ports are all bound to the pcie portdrv via the appropriate PCI class
code.  Whilst some CXL support of functionality on RP and Switch Ports
is handled by the CXL port driver, that is not bound to the PCIe device,
is only instantiated as part of enumerating endpoints, and cannot use
interrupts because those are in control of the PCIe portdrv. A PMU with
out interrupts would be unfortunate so a different solution is needed.

Requirements
============

- Register multiple CXL PMUs (CPMUs) per portdrv instance.
- portdrv binds to the PCIe class code for PCI_CLASS_BRIDGE_PCI_NORMAL which
  covers any PCI-Express port.
- PCI MSI / MSIX message based interrupts must be registered by one driver -
  in this case it's the portdrv.
- portdrv goes through a dance to discover the minimal number of irq vectors
  required, and as part of this it needs to find out the highest vector number
  used.
- CXL PMUs store the interrupt message number (the one needed to find the
  above number of vectors registered) in a register in a BAR.  That
  BAR + offset of the register block is discovered from the CXL
  Register Locator DVSEC.  As such finding the maximum interrupt message
  number used by any CPMU requires checking all CXL PMU register blocks
  in the CXL Register Locator DVSEC, briefly mapping their register space
  and checking that one register.  This is safe to do because the rest
  of the CXL stack doesn't map this section of the BAR and the CXL PMU
  device is not added until after we have unmapped it again.

Dependency fun.
- The existing CXL_PMU driver registers the device on the CXL_BUS.
- portdrv is not modular.
- CONFIG_CXL_BUS is potentially modular (i.e. not available until the
  core CXL module is loaded).
- The portdrv can't be dependent on CONFIG_CXL_BUS directly without
  forcing CXL to be built in.

There are various ways to solve this dependency issue an meet the above
requirements.

1. Add an intermediate glue device that has a driver that can depend on
   CONFIG_CXL_BUS and hence ensure that is loaded before doing a full
   enumeration of the CXL PMU instances and registering  appropriate
   devices. To allow for simple tear down, device managed interfaces are
   used against the platform driver, so that when it goes away it removes
   the CPMU device instances cleanly. The parents are set one level higher
   however, so that path can be used to get to the underlying PCI Device
   (which can't go away without tearing everything down as it's the
    grandparent device).

                                                             On CXL BUS    
 _____________ __            ________________ __           ________ __________
|  Port       |  | creates  |                |  | creates |        |          |
|  PCI Dev    |  |--------->| PCIE CPMU GLUE |  |-------->| CPMU A |          |
|_____________|  |          | Platform Device|  |         |________|          |
|pordrv binds    |          |________________|  |         | perf/cxlpmu binds |
|________________|          |cxlpmu_glue binds  |         |___________________|
                            | enumerates CPMUs  |          ________ __________
                      Deps /|___________________|-------->|        |          |
                          /                               | CPMU B |          |
                    cxl_core.ko                           |________|          |
		    /  providing CXL_BUS                  | perf/cxlpmu binds | 
               Deps/                                      |___________________|
 _____________ __ /
|  Type 3     |  | creates                                 ________ _________
|  PCI Dev    |  |--------------------------------------->|        |         |
|_____________|  |                                        | CPMU C |         |
| cxl_pci binds  |                                        |________|         |
|________________|                                        | perf/cxpmu binds |
                                                          |__________________|

2. Make the CXL_PMU driver handle multiple buses. This will look similar
   to a sensor driver with I2C and SPI drivers but in this case register as a
   driver for devices on the CXL_BUS and one for another 'new' bus (e.g.
   the auxiliary bus which exists to enable this sort of hybrid driver)
   Register auxiliary devices directly from portdrv. This has to be done
   in multiple steps so enough interrupt vectors are allocated before the
   CPMU device is added and that driver might probe.

                                On auxiliary bus, children of the portdrv.
 _____________ __            ________ __________
|  Port       |  | creates  |        |          |
|  PCI Dev    |  |--------->| CPMU A |          |
|_____________|  |          |________|          |
|pordrv binds    |          | perf/cxlpmu binds |
|________________|          |___________________|
                 \         
                  \
                   \         ________ __________
                    \       |        |          |
                     ------>| CPMU A |          |    
                            |________|          |
                            | perf/cxlpmu binds |
                            |___________________|
 AND
 
                     cxl_core.ko           
		    /  providing CXL_BUS   
               Deps/                                         On CXL BUS
 _____________ __ /
|  Type 3     |  | creates                                 ________ _________
|  PCI Dev    |  |--------------------------------------->|        |         |
|_____________|  |                                        | CPMU C |         |
| cxl_pci binds  |                                        |________|         |
|________________|                                        | perf/cxpmu binds |
                                                          |__________________|                  

I have code for both and on balance option 2 is cleaner as no magic glue
devices are registered so that's what is presented in this RFC

AER experiment
==============

Note that this discussion oddly enough doesn't apply to the the CXL
PMU because we need a bus structure of some type to solve the
registration dependency problems.  The existing portdrv subdrivers
(or at least those I've looked at in detail) are much simpler.

Having two different types of bus under a portdrv instance is far
from ideal. Given I'm adding an auxbus and I wanted to poke some of
the corners + I have suitable CXL error injection support in QEMU
that lets me exercise the AER flows I decided to play a bit with that.
Note that there is some power management support in this patch set
and I hacked in callbacks to the AER driver to test that but it is
infrastructure that AER itself doesn't need.

In previous discussions various approaches have been proposed.

1) Auxiliary bus as done here.  It works, but is messy and there
   is realtively little point if the AER driver is always forced
   to be built in and load immediately anyway. Is there any interest
   in making the service drivers modular?

2) Turning the service drivers into straight forward library
   function calls.  This basically means providing some storage
   for their state and calling the bits of probe and remove
   not related to the struct device (which will no longer exist)
   directly.

I'm happy to look at either, but keen that if possible we don't
gate the CPMU support on that.  There have been discusisons around
the need for vendor specific telemetry solutions on CXL switches
and to try and encourage the upstream approach, I'd like to ensure
we have good support the facilities in the CXL specification.

Side question.  Does anyone care if /sys/bus/pci_express goes away?
- in theory it's ABI, but in practice it provides very little
  actual ABI.

If we do need to maintain that, refactoring this code gets much
harder and I suspect our only way forwards is a cut and paste
version of the auxiliary_bus that gives the flexibility needed
but would still provide that /sys/bus/pci_express.
Not nice if we can avoid it!

Other misc questions / comments.
- Can we duplicate the irq discovery into the auxiliary_bus drivers?
  They do need to be discovered in pordrv to work out how many
  vectors to request, but the current multipass approach of
  (1. Discovery, 2. IRQ pass one, 3. Store the IRQs) means
  auxiliary devices can only be created in an additional pass 4
  unless we don't pass them the irqs and allow the child device
  drivers to query it directly when they load.  Note this also
  avoids the theoretical possiblity that they move when portdrv
  reduces the requested vectors (to avoid waste).
- Additional callbacks for auxiliary_driver needed for runtime pm.
  For now I've just done suspend and resume, but we need the
  noirq variant and runtime_suspend / runtime_resume callbacks
  to support all the existing pcie_driver handled subdevices.
- Worth moving any of the drivers that register in parallel
  with the pcie portdrv over to this new approach (the Designware
  PMU for example?)
  
Jonathan Cameron (9):
  pci: pcie: Drop priv_data from struct pcie_device and use
    dev_get/set_drvdata() instead.
  pci: portdrv: Drop driver field for port type.
  pci: pcie: portdrv: Use managed device handling to simplify error and
    remove flows.
  auxiliary_bus: expose auxiliary_bus_type
  pci: pcie: portdrv: Add a auxiliary_bus
  cxl: Move CPMU register definitions to header
  pci: pcie/cxl: Register an auxiliary device for each CPMU instance
  perf: cxl: Make the cpmu driver also work with auxiliary_devices
  pci: pcie: portdrv: aer: Switch to auxiliary_bus

 drivers/base/auxiliary.c          |   2 +-
 drivers/cxl/pmu.h                 |  54 +++++++
 drivers/pci/hotplug/pciehp_core.c |  13 +-
 drivers/pci/hotplug/pciehp_hpc.c  |   2 +-
 drivers/pci/pci-driver.c          |   4 -
 drivers/pci/pcie/Kconfig          |  10 ++
 drivers/pci/pcie/Makefile         |   1 +
 drivers/pci/pcie/aer.c            |  68 +++++---
 drivers/pci/pcie/cxlpmu.c         | 129 +++++++++++++++
 drivers/pci/pcie/dpc.c            |   1 -
 drivers/pci/pcie/pme.c            |  14 +-
 drivers/pci/pcie/portdrv.c        | 254 +++++++++++++++++++++++-------
 drivers/pci/pcie/portdrv.h        |  37 +++--
 drivers/perf/Kconfig              |   1 +
 drivers/perf/cxl_pmu.c            | 153 ++++++++++--------
 include/linux/auxiliary_bus.h     |   1 +
 16 files changed, 558 insertions(+), 186 deletions(-)
 create mode 100644 drivers/pci/pcie/cxlpmu.c

Comments

Bjorn Helgaas June 5, 2024, 6:04 p.m. UTC | #1
On Wed, May 29, 2024 at 05:40:54PM +0100, Jonathan Cameron wrote:
> Focus of this RFC is CXL Performance Monitoring Units in CXL Root Ports +
> Switch USP and DSPs.
> 
> The final patch moving AER to the aux bus is in the category of 'why
> not try this' rather than something I feel is a particularly good idea.
> I would like to get opinions on the various ways to avoid the double bus
> situation introduced here. Some comments on other options for those existing
> 'pci_express' bus devices at the end of this cover letter.
> 
> The primary problem this RFC tries to solve is providing a means to
> register the CXL PMUs found in devices which will be bound to the
> PCIe portdrv. The proposed solution is to avoid the limitations of
> the existing pcie service driver approach by bolting on support
> for devices registered on the auxiliary_bus.
> 
> Background
> ==========
> 
> When the CXL PMU driver was added, only the instances found in CXL type 3
> memory devices (end points) were supported. These PMUs also occur in CXL
> root ports, and CXL switch upstream and downstream ports.  Adding
> support for these was deliberately left for future work because theses
> ports are all bound to the pcie portdrv via the appropriate PCI class
> code.  Whilst some CXL support of functionality on RP and Switch Ports
> is handled by the CXL port driver, that is not bound to the PCIe device,
> is only instantiated as part of enumerating endpoints, and cannot use
> interrupts because those are in control of the PCIe portdrv. A PMU with
> out interrupts would be unfortunate so a different solution is needed.
> 
> Requirements
> ============
> 
> - Register multiple CXL PMUs (CPMUs) per portdrv instance.

What resources do CPMUs use?  BARs?  Config space registers in PCI
Capabilities?  Interrupts?  Are any of these resources
device-specific, or are they all defined by the CXL specs?

The "device" is a nice way to package those up and manage ownership
since there are often interdependencies.

I wish portdrv was directly implemented as part of the PCI core,
without binding to the device as a "driver".  We handle some other
pieces of functionality that way, e.g., the PCIe Capability itself,
PM, VPD, MSI/MSI-X, 

> - portdrv binds to the PCIe class code for PCI_CLASS_BRIDGE_PCI_NORMAL which
>   covers any PCI-Express port.
> - PCI MSI / MSIX message based interrupts must be registered by one driver -
>   in this case it's the portdrv.

The fact that AER/DPC/hotplug/etc (the portdrv services) use MSI/MSI-X
is a really annoying part of PCIe because bridges may have a BAR or
two and are allowed to have device-specific endpoint-like behavior, so
we may have to figure out how to share those interrupts between the
PCIe-architected stuff and the device-specific stuff.  I guess that's
messy no matter whether portdrv is its own separate driver or it's
integrated into the core.

> Side question.  Does anyone care if /sys/bus/pci_express goes away?
> - in theory it's ABI, but in practice it provides very little
>   actual ABI.

I would *love* to get rid of /sys/bus/pci_express, but I don't know
what, if anything, would break if we removed it.

Bjorn
Jonathan Cameron June 5, 2024, 7:44 p.m. UTC | #2
On Wed, 5 Jun 2024 13:04:09 -0500
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Wed, May 29, 2024 at 05:40:54PM +0100, Jonathan Cameron wrote:
> > Focus of this RFC is CXL Performance Monitoring Units in CXL Root Ports +
> > Switch USP and DSPs.
> > 
> > The final patch moving AER to the aux bus is in the category of 'why
> > not try this' rather than something I feel is a particularly good idea.
> > I would like to get opinions on the various ways to avoid the double bus
> > situation introduced here. Some comments on other options for those existing
> > 'pci_express' bus devices at the end of this cover letter.
> > 
> > The primary problem this RFC tries to solve is providing a means to
> > register the CXL PMUs found in devices which will be bound to the
> > PCIe portdrv. The proposed solution is to avoid the limitations of
> > the existing pcie service driver approach by bolting on support
> > for devices registered on the auxiliary_bus.
> > 
> > Background
> > ==========
> > 
> > When the CXL PMU driver was added, only the instances found in CXL type 3
> > memory devices (end points) were supported. These PMUs also occur in CXL
> > root ports, and CXL switch upstream and downstream ports.  Adding
> > support for these was deliberately left for future work because theses
> > ports are all bound to the pcie portdrv via the appropriate PCI class
> > code.  Whilst some CXL support of functionality on RP and Switch Ports
> > is handled by the CXL port driver, that is not bound to the PCIe device,
> > is only instantiated as part of enumerating endpoints, and cannot use
> > interrupts because those are in control of the PCIe portdrv. A PMU with
> > out interrupts would be unfortunate so a different solution is needed.
> > 
> > Requirements
> > ============
> > 
> > - Register multiple CXL PMUs (CPMUs) per portdrv instance.  
> 
> What resources do CPMUs use?  BARs?  Config space registers in PCI
> Capabilities?  Interrupts?  Are any of these resources
> device-specific, or are they all defined by the CXL specs?

Uses the whole lot (there isn't a toy out there that the CXL
spec doesn't like to play with :). Discoverable via a CXL defined DVSEC
in config space. Registers are in a BAR (discovered from the DVSEC).
MSI/MSIX, number in a register in the BAR.

All the basic infrastructure and some performance event definitions
are in the CXL spec.  It's all discoverable. 

The events are even tagged with VID so a vendor can implement
someone else's definitions or those coming from another specification.
A bunch of CXL VID tagged ones exist for protocol events etc.

It's a really nice general spec.  If I were more of a dreamer and had
the time I'd try to get PCI-SIG to adopt it and we'd finally have
something generic for PCI performance counting.

> 
> The "device" is a nice way to package those up and manage ownership
> since there are often interdependencies.
> 
> I wish portdrv was directly implemented as part of the PCI core,
> without binding to the device as a "driver".  We handle some other
> pieces of functionality that way, e.g., the PCIe Capability itself,
> PM, VPD, MSI/MSI-X, 

Understood, though I would guess that even then there needs to
be a pci_driver binding to the class code to actually query all those
facilities and get interrupts registered etc.

> 
> > - portdrv binds to the PCIe class code for PCI_CLASS_BRIDGE_PCI_NORMAL which
> >   covers any PCI-Express port.
> > - PCI MSI / MSIX message based interrupts must be registered by one driver -
> >   in this case it's the portdrv.  
> 
> The fact that AER/DPC/hotplug/etc (the portdrv services) use MSI/MSI-X
> is a really annoying part of PCIe because bridges may have a BAR or
> two and are allowed to have device-specific endpoint-like behavior, so
> we may have to figure out how to share those interrupts between the
> PCIe-architected stuff and the device-specific stuff.  I guess that's
> messy no matter whether portdrv is its own separate driver or it's
> integrated into the core.

Yes, the interrupt stuff is the tricky bit.  I think whatever happens
we end up with a pci device driver that binds to the class code.
Maybe we have a way to switch in alternative drivers if that turns
out to be necessary.

Trying to actually manage the interrupts in the core (without a driver binding)
is really tricky.  Maybe we'll get there one day, but I don't think
it is the first target.  We should however make sure not to do anything
that would make that harder such as adding ABI in the wrong places.

> 
> > Side question.  Does anyone care if /sys/bus/pci_express goes away?
> > - in theory it's ABI, but in practice it provides very little
> >   actual ABI.  
> 
> I would *love* to get rid of /sys/bus/pci_express, but I don't know
> what, if anything, would break if we removed it.

Could try it and see who screams ;) or fake it via symlinks or similar
(under a suitable 'legacy' config variable that is on by default.)

How about I try the following steps:
0) An experiment to make sure I can fake /sys/bus/pci_express so it's
   at least hard to tell there aren't real 'devices' registered on that
   bus. Even if we decide not to do that long term, we need to know
   we can if a regressions report comes in.
   Worst comes to the worse we can register fake devices that fake
   drivers bind to that don't do anything beyond fill in sysfs.
1) Replace the bus/pci_express with direct functional calls in
   portdrv.  
   I'm thinking we have a few calls for each:
    bool aer_present(struct pci_dev *pci);
    aer_query_max_message_num() etc - used so we can do the msi/msix bit.
    aer_initialize()
    aer_finalize()
    aer_runtime_suspend() (where relevant for particular services)

   at the cost of a few less loops and a few more explicit calls
   that should simplify how things fit together.

2) Fix up anything in all that functionality that really cares
   that they are part of portdrv.  Not sure yet, but we may need
   a few additional things in struct pci_dev, or a to pass in some
   state storage in the ae_initialize() call or similar.
3) Consider moving that functionality to a more appropriate place.
   At this point we'll have services that can be used by other
   drivers - though I'm not yet sure how useful that will be.
4) End up with a small pci_driver that binds to pci devices with
   the appropriate class code. Should even be able to make it
   a module.
5) (maybe do this first) carry my auxiliary_bus + cpmu stuff
   and allow adopting other similar cases alongside the other
   parts.

Of course, no statements on 'when' I'll make progress if this seems
like a step forwards, but probably sometime this summer. Maybe sooner.

Jonathan

> 
> Bjorn
Jonathan Cameron June 6, 2024, 12:57 p.m. UTC | #3
On Wed, 5 Jun 2024 20:44:28 +0100
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

> On Wed, 5 Jun 2024 13:04:09 -0500
> Bjorn Helgaas <helgaas@kernel.org> wrote:
> 
> > On Wed, May 29, 2024 at 05:40:54PM +0100, Jonathan Cameron wrote:  
> > > Focus of this RFC is CXL Performance Monitoring Units in CXL Root Ports +
> > > Switch USP and DSPs.
> > > 
> > > The final patch moving AER to the aux bus is in the category of 'why
> > > not try this' rather than something I feel is a particularly good idea.
> > > I would like to get opinions on the various ways to avoid the double bus
> > > situation introduced here. Some comments on other options for those existing
> > > 'pci_express' bus devices at the end of this cover letter.
> > > 
> > > The primary problem this RFC tries to solve is providing a means to
> > > register the CXL PMUs found in devices which will be bound to the
> > > PCIe portdrv. The proposed solution is to avoid the limitations of
> > > the existing pcie service driver approach by bolting on support
> > > for devices registered on the auxiliary_bus.
> > > 
> > > Background
> > > ==========
> > > 
> > > When the CXL PMU driver was added, only the instances found in CXL type 3
> > > memory devices (end points) were supported. These PMUs also occur in CXL
> > > root ports, and CXL switch upstream and downstream ports.  Adding
> > > support for these was deliberately left for future work because theses
> > > ports are all bound to the pcie portdrv via the appropriate PCI class
> > > code.  Whilst some CXL support of functionality on RP and Switch Ports
> > > is handled by the CXL port driver, that is not bound to the PCIe device,
> > > is only instantiated as part of enumerating endpoints, and cannot use
> > > interrupts because those are in control of the PCIe portdrv. A PMU with
> > > out interrupts would be unfortunate so a different solution is needed.
> > > 
> > > Requirements
> > > ============
> > > 
> > > - Register multiple CXL PMUs (CPMUs) per portdrv instance.    
> > 
> > What resources do CPMUs use?  BARs?  Config space registers in PCI
> > Capabilities?  Interrupts?  Are any of these resources
> > device-specific, or are they all defined by the CXL specs?  
> 
> Uses the whole lot (there isn't a toy out there that the CXL
> spec doesn't like to play with :). Discoverable via a CXL defined DVSEC
> in config space. Registers are in a BAR (discovered from the DVSEC).
> MSI/MSIX, number in a register in the BAR.
> 
> All the basic infrastructure and some performance event definitions
> are in the CXL spec.  It's all discoverable. 
> 
> The events are even tagged with VID so a vendor can implement
> someone else's definitions or those coming from another specification.
> A bunch of CXL VID tagged ones exist for protocol events etc.
> 
> It's a really nice general spec.  If I were more of a dreamer and had
> the time I'd try to get PCI-SIG to adopt it and we'd finally have
> something generic for PCI performance counting.
> 
> > 
> > The "device" is a nice way to package those up and manage ownership
> > since there are often interdependencies.
> > 
> > I wish portdrv was directly implemented as part of the PCI core,
> > without binding to the device as a "driver".  We handle some other
> > pieces of functionality that way, e.g., the PCIe Capability itself,
> > PM, VPD, MSI/MSI-X,   
> 
> Understood, though I would guess that even then there needs to
> be a pci_driver binding to the class code to actually query all those
> facilities and get interrupts registered etc.

Or are you thinking we can make something like the following work
(even when we can't do dynamic msix)?

Core bring up facilities and interrupt etc.  That includes bus master
so msi/msix can be issued (which makes me nervous - putting aside other
concerns as IIRC there are drivers where you have to be careful to
tweak registers to ensure you don't get a storm of traffic when you
hit bus master enable.

Specific driver may bind later - everything keeps running until 
the specific driver calls pci_alloc_irq_vectors(), then we have to disable all
interrupts for core managed services, change the number of vectors and
then move them to wherever they end up before starting them again.
We have to do the similar in the unwind path.

I can see we might make something like that work, but I'd be very worried
we'd hit devices that didn't behave correctly under this flow even if
there aren't any specification caused problems.

Even if this is a possible long term plan, maybe we don't want to try and
get there in one hop?

Jonathan

> 
> >   
> > > - portdrv binds to the PCIe class code for PCI_CLASS_BRIDGE_PCI_NORMAL which
> > >   covers any PCI-Express port.
> > > - PCI MSI / MSIX message based interrupts must be registered by one driver -
> > >   in this case it's the portdrv.    
> > 
> > The fact that AER/DPC/hotplug/etc (the portdrv services) use MSI/MSI-X
> > is a really annoying part of PCIe because bridges may have a BAR or
> > two and are allowed to have device-specific endpoint-like behavior, so
> > we may have to figure out how to share those interrupts between the
> > PCIe-architected stuff and the device-specific stuff.  I guess that's
> > messy no matter whether portdrv is its own separate driver or it's
> > integrated into the core.  
> 
> Yes, the interrupt stuff is the tricky bit.  I think whatever happens
> we end up with a pci device driver that binds to the class code.
> Maybe we have a way to switch in alternative drivers if that turns
> out to be necessary.
> 
> Trying to actually manage the interrupts in the core (without a driver binding)
> is really tricky.  Maybe we'll get there one day, but I don't think
> it is the first target.  We should however make sure not to do anything
> that would make that harder such as adding ABI in the wrong places.
> 
> >   
> > > Side question.  Does anyone care if /sys/bus/pci_express goes away?
> > > - in theory it's ABI, but in practice it provides very little
> > >   actual ABI.    
> > 
> > I would *love* to get rid of /sys/bus/pci_express, but I don't know
> > what, if anything, would break if we removed it.  
> 
> Could try it and see who screams ;) or fake it via symlinks or similar
> (under a suitable 'legacy' config variable that is on by default.)
> 
> How about I try the following steps:
> 0) An experiment to make sure I can fake /sys/bus/pci_express so it's
>    at least hard to tell there aren't real 'devices' registered on that
>    bus. Even if we decide not to do that long term, we need to know
>    we can if a regressions report comes in.
>    Worst comes to the worse we can register fake devices that fake
>    drivers bind to that don't do anything beyond fill in sysfs.
> 1) Replace the bus/pci_express with direct functional calls in
>    portdrv.  
>    I'm thinking we have a few calls for each:
>     bool aer_present(struct pci_dev *pci);
>     aer_query_max_message_num() etc - used so we can do the msi/msix bit.
>     aer_initialize()
>     aer_finalize()
>     aer_runtime_suspend() (where relevant for particular services)
> 
>    at the cost of a few less loops and a few more explicit calls
>    that should simplify how things fit together.
> 
> 2) Fix up anything in all that functionality that really cares
>    that they are part of portdrv.  Not sure yet, but we may need
>    a few additional things in struct pci_dev, or a to pass in some
>    state storage in the ae_initialize() call or similar.
> 3) Consider moving that functionality to a more appropriate place.
>    At this point we'll have services that can be used by other
>    drivers - though I'm not yet sure how useful that will be.
> 4) End up with a small pci_driver that binds to pci devices with
>    the appropriate class code. Should even be able to make it
>    a module.
> 5) (maybe do this first) carry my auxiliary_bus + cpmu stuff
>    and allow adopting other similar cases alongside the other
>    parts.
> 
> Of course, no statements on 'when' I'll make progress if this seems
> like a step forwards, but probably sometime this summer. Maybe sooner.
> 
> Jonathan
> 
> > 
> > Bjorn  
> 
>
Lukas Wunner June 6, 2024, 1:21 p.m. UTC | #4
On Thu, Jun 06, 2024 at 01:57:56PM +0100, Jonathan Cameron wrote:
> Or are you thinking we can make something like the following work
> (even when we can't do dynamic msix)?
> 
> Core bring up facilities and interrupt etc.  That includes bus master
> so msi/msix can be issued (which makes me nervous - putting aside other
> concerns as IIRC there are drivers where you have to be careful to
> tweak registers to ensure you don't get a storm of traffic when you
> hit bus master enable.
> 
> Specific driver may bind later - everything keeps running until 
> the specific driver calls pci_alloc_irq_vectors(), then we have to disable all
> interrupts for core managed services, change the number of vectors and
> then move them to wherever they end up before starting them again.
> We have to do the similar in the unwind path.

My recollection is that Thomas Gleixner has brought up dynamic MSI-X
allocation.  So if both the PCI core services as well as the driver
use MSI-X, there's no problem.

For MSI, one approach might be to reserve a certain number of vectors
in the core for later use by the driver.

Thanks,

Lukas
Ilpo Järvinen June 17, 2024, 7:03 a.m. UTC | #5
On Thu, 6 Jun 2024, Jonathan Cameron wrote:

> On Wed, 5 Jun 2024 20:44:28 +0100
> Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> 
> > On Wed, 5 Jun 2024 13:04:09 -0500
> > Bjorn Helgaas <helgaas@kernel.org> wrote:
> > 
> > > On Wed, May 29, 2024 at 05:40:54PM +0100, Jonathan Cameron wrote:  
> > > > Focus of this RFC is CXL Performance Monitoring Units in CXL Root Ports +
> > > > Switch USP and DSPs.
> > > > 
> > > > The final patch moving AER to the aux bus is in the category of 'why
> > > > not try this' rather than something I feel is a particularly good idea.
> > > > I would like to get opinions on the various ways to avoid the double bus
> > > > situation introduced here. Some comments on other options for those existing
> > > > 'pci_express' bus devices at the end of this cover letter.
> > > > 
> > > > The primary problem this RFC tries to solve is providing a means to
> > > > register the CXL PMUs found in devices which will be bound to the
> > > > PCIe portdrv. The proposed solution is to avoid the limitations of
> > > > the existing pcie service driver approach by bolting on support
> > > > for devices registered on the auxiliary_bus.
> > > > 
> > > > Background
> > > > ==========
> > > > 
> > > > When the CXL PMU driver was added, only the instances found in CXL type 3
> > > > memory devices (end points) were supported. These PMUs also occur in CXL
> > > > root ports, and CXL switch upstream and downstream ports.  Adding
> > > > support for these was deliberately left for future work because theses
> > > > ports are all bound to the pcie portdrv via the appropriate PCI class
> > > > code.  Whilst some CXL support of functionality on RP and Switch Ports
> > > > is handled by the CXL port driver, that is not bound to the PCIe device,
> > > > is only instantiated as part of enumerating endpoints, and cannot use
> > > > interrupts because those are in control of the PCIe portdrv. A PMU with
> > > > out interrupts would be unfortunate so a different solution is needed.
> > > > 
> > > > Requirements
> > > > ============
> > > > 
> > > > - Register multiple CXL PMUs (CPMUs) per portdrv instance.    
> > > 
> > > What resources do CPMUs use?  BARs?  Config space registers in PCI
> > > Capabilities?  Interrupts?  Are any of these resources
> > > device-specific, or are they all defined by the CXL specs?  
> > 
> > Uses the whole lot (there isn't a toy out there that the CXL
> > spec doesn't like to play with :). Discoverable via a CXL defined DVSEC
> > in config space. Registers are in a BAR (discovered from the DVSEC).
> > MSI/MSIX, number in a register in the BAR.
> > 
> > All the basic infrastructure and some performance event definitions
> > are in the CXL spec.  It's all discoverable. 
> > 
> > The events are even tagged with VID so a vendor can implement
> > someone else's definitions or those coming from another specification.
> > A bunch of CXL VID tagged ones exist for protocol events etc.
> > 
> > It's a really nice general spec.  If I were more of a dreamer and had
> > the time I'd try to get PCI-SIG to adopt it and we'd finally have
> > something generic for PCI performance counting.
> > 
> > > 
> > > The "device" is a nice way to package those up and manage ownership
> > > since there are often interdependencies.
> > > 
> > > I wish portdrv was directly implemented as part of the PCI core,
> > > without binding to the device as a "driver".  We handle some other
> > > pieces of functionality that way, e.g., the PCIe Capability itself,
> > > PM, VPD, MSI/MSI-X,   
> > 
> > Understood, though I would guess that even then there needs to
> > be a pci_driver binding to the class code to actually query all those
> > facilities and get interrupts registered etc.
> 
> Or are you thinking we can make something like the following work
> (even when we can't do dynamic msix)?
> 
> Core bring up facilities and interrupt etc.  That includes bus master
> so msi/msix can be issued (which makes me nervous - putting aside other
> concerns as IIRC there are drivers where you have to be careful to
> tweak registers to ensure you don't get a storm of traffic when you
> hit bus master enable.
> 
> Specific driver may bind later - everything keeps running until 
> the specific driver calls pci_alloc_irq_vectors(), then we have to disable all
> interrupts for core managed services, change the number of vectors and
> then move them to wherever they end up before starting them again.
> We have to do the similar in the unwind path.
> 
> I can see we might make something like that work, but I'd be very worried
> we'd hit devices that didn't behave correctly under this flow even if
> there aren't any specification caused problems.
> 
> Even if this is a possible long term plan, maybe we don't want to try and
> get there in one hop?

I've just a small bit of detail to add here (but I'm far from expert when 
it comes to the driver model, rpm, etc. areas so please keep that in mind
if something I say doesn't make sense).

One problematic thing with current portdrv model and interrupts is that
portdrv resumes too late from hotplug's point of view.

When resuming, the PCI core tries to ensure downstream port's bus and 
devices there are in usable state by waiting for the devices before 
portdrv is resumed. For hotplug case, this is problematic because if 
devices were unplugged during suspend or are unplugged during that wait, 
hotplug cannot notice the unplug before the wait times out because hotplug 
is not yet resumed.
Jonathan Cameron July 4, 2024, 4:14 p.m. UTC | #6
On Thu, 6 Jun 2024 13:57:56 +0100
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

> On Wed, 5 Jun 2024 20:44:28 +0100
> Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> 
> > On Wed, 5 Jun 2024 13:04:09 -0500
> > Bjorn Helgaas <helgaas@kernel.org> wrote:
> >   
> > > On Wed, May 29, 2024 at 05:40:54PM +0100, Jonathan Cameron wrote:    
> > > > Focus of this RFC is CXL Performance Monitoring Units in CXL Root Ports +
> > > > Switch USP and DSPs.
> > > > 
> > > > The final patch moving AER to the aux bus is in the category of 'why
> > > > not try this' rather than something I feel is a particularly good idea.
> > > > I would like to get opinions on the various ways to avoid the double bus
> > > > situation introduced here. Some comments on other options for those existing
> > > > 'pci_express' bus devices at the end of this cover letter.
> > > > 
> > > > The primary problem this RFC tries to solve is providing a means to
> > > > register the CXL PMUs found in devices which will be bound to the
> > > > PCIe portdrv. The proposed solution is to avoid the limitations of
> > > > the existing pcie service driver approach by bolting on support
> > > > for devices registered on the auxiliary_bus.
> > > > 
> > > > Background
> > > > ==========
> > > > 
> > > > When the CXL PMU driver was added, only the instances found in CXL type 3
> > > > memory devices (end points) were supported. These PMUs also occur in CXL
> > > > root ports, and CXL switch upstream and downstream ports.  Adding
> > > > support for these was deliberately left for future work because theses
> > > > ports are all bound to the pcie portdrv via the appropriate PCI class
> > > > code.  Whilst some CXL support of functionality on RP and Switch Ports
> > > > is handled by the CXL port driver, that is not bound to the PCIe device,
> > > > is only instantiated as part of enumerating endpoints, and cannot use
> > > > interrupts because those are in control of the PCIe portdrv. A PMU with
> > > > out interrupts would be unfortunate so a different solution is needed.
> > > > 
> > > > Requirements
> > > > ============
> > > > 
> > > > - Register multiple CXL PMUs (CPMUs) per portdrv instance.      
> > > 
> > > What resources do CPMUs use?  BARs?  Config space registers in PCI
> > > Capabilities?  Interrupts?  Are any of these resources
> > > device-specific, or are they all defined by the CXL specs?    
> > 
> > Uses the whole lot (there isn't a toy out there that the CXL
> > spec doesn't like to play with :). Discoverable via a CXL defined DVSEC
> > in config space. Registers are in a BAR (discovered from the DVSEC).
> > MSI/MSIX, number in a register in the BAR.
> > 
> > All the basic infrastructure and some performance event definitions
> > are in the CXL spec.  It's all discoverable. 
> > 
> > The events are even tagged with VID so a vendor can implement
> > someone else's definitions or those coming from another specification.
> > A bunch of CXL VID tagged ones exist for protocol events etc.
> > 
> > It's a really nice general spec.  If I were more of a dreamer and had
> > the time I'd try to get PCI-SIG to adopt it and we'd finally have
> > something generic for PCI performance counting.
> >   
> > > 
> > > The "device" is a nice way to package those up and manage ownership
> > > since there are often interdependencies.
> > > 
> > > I wish portdrv was directly implemented as part of the PCI core,
> > > without binding to the device as a "driver".  We handle some other
> > > pieces of functionality that way, e.g., the PCIe Capability itself,
> > > PM, VPD, MSI/MSI-X,     
> > 
> > Understood, though I would guess that even then there needs to
> > be a pci_driver binding to the class code to actually query all those
> > facilities and get interrupts registered etc.  
> 
> Or are you thinking we can make something like the following work
> (even when we can't do dynamic msix)?
> 
> Core bring up facilities and interrupt etc.  That includes bus master
> so msi/msix can be issued (which makes me nervous - putting aside other
> concerns as IIRC there are drivers where you have to be careful to
> tweak registers to ensure you don't get a storm of traffic when you
> hit bus master enable.
> 
> Specific driver may bind later - everything keeps running until 
> the specific driver calls pci_alloc_irq_vectors(), then we have to disable all
> interrupts for core managed services, change the number of vectors and
> then move them to wherever they end up before starting them again.
> We have to do the similar in the unwind path.
> 
> I can see we might make something like that work, but I'd be very worried
> we'd hit devices that didn't behave correctly under this flow even if
> there aren't any specification caused problems.
> 
> Even if this is a possible long term plan, maybe we don't want to try and
> get there in one hop?

In the spirit of 'conference driven' development. I'm planning to put in a
proposal for a discussion of portdrv issues for the LPC VFIO/IOMMU/PCI uconf
having prototyped some options.  A key bit will be agreeing what the
actual requirements are and how we might meet them.

If it's not selected I'd still be keen to discuss whatever state things
are in come September - I'll just have a less busy summer ;)

Jonathan

> 
> Jonathan
> 
> >   
> > >     
> > > > - portdrv binds to the PCIe class code for PCI_CLASS_BRIDGE_PCI_NORMAL which
> > > >   covers any PCI-Express port.
> > > > - PCI MSI / MSIX message based interrupts must be registered by one driver -
> > > >   in this case it's the portdrv.      
> > > 
> > > The fact that AER/DPC/hotplug/etc (the portdrv services) use MSI/MSI-X
> > > is a really annoying part of PCIe because bridges may have a BAR or
> > > two and are allowed to have device-specific endpoint-like behavior, so
> > > we may have to figure out how to share those interrupts between the
> > > PCIe-architected stuff and the device-specific stuff.  I guess that's
> > > messy no matter whether portdrv is its own separate driver or it's
> > > integrated into the core.    
> > 
> > Yes, the interrupt stuff is the tricky bit.  I think whatever happens
> > we end up with a pci device driver that binds to the class code.
> > Maybe we have a way to switch in alternative drivers if that turns
> > out to be necessary.
> > 
> > Trying to actually manage the interrupts in the core (without a driver binding)
> > is really tricky.  Maybe we'll get there one day, but I don't think
> > it is the first target.  We should however make sure not to do anything
> > that would make that harder such as adding ABI in the wrong places.
> >   
> > >     
> > > > Side question.  Does anyone care if /sys/bus/pci_express goes away?
> > > > - in theory it's ABI, but in practice it provides very little
> > > >   actual ABI.      
> > > 
> > > I would *love* to get rid of /sys/bus/pci_express, but I don't know
> > > what, if anything, would break if we removed it.    
> > 
> > Could try it and see who screams ;) or fake it via symlinks or similar
> > (under a suitable 'legacy' config variable that is on by default.)
> > 
> > How about I try the following steps:
> > 0) An experiment to make sure I can fake /sys/bus/pci_express so it's
> >    at least hard to tell there aren't real 'devices' registered on that
> >    bus. Even if we decide not to do that long term, we need to know
> >    we can if a regressions report comes in.
> >    Worst comes to the worse we can register fake devices that fake
> >    drivers bind to that don't do anything beyond fill in sysfs.
> > 1) Replace the bus/pci_express with direct functional calls in
> >    portdrv.  
> >    I'm thinking we have a few calls for each:
> >     bool aer_present(struct pci_dev *pci);
> >     aer_query_max_message_num() etc - used so we can do the msi/msix bit.
> >     aer_initialize()
> >     aer_finalize()
> >     aer_runtime_suspend() (where relevant for particular services)
> > 
> >    at the cost of a few less loops and a few more explicit calls
> >    that should simplify how things fit together.
> > 
> > 2) Fix up anything in all that functionality that really cares
> >    that they are part of portdrv.  Not sure yet, but we may need
> >    a few additional things in struct pci_dev, or a to pass in some
> >    state storage in the ae_initialize() call or similar.
> > 3) Consider moving that functionality to a more appropriate place.
> >    At this point we'll have services that can be used by other
> >    drivers - though I'm not yet sure how useful that will be.
> > 4) End up with a small pci_driver that binds to pci devices with
> >    the appropriate class code. Should even be able to make it
> >    a module.
> > 5) (maybe do this first) carry my auxiliary_bus + cpmu stuff
> >    and allow adopting other similar cases alongside the other
> >    parts.
> > 
> > Of course, no statements on 'when' I'll make progress if this seems
> > like a step forwards, but probably sometime this summer. Maybe sooner.
> > 
> > Jonathan
> >   
> > > 
> > > Bjorn    
> > 
> >   
>
Jonathan Cameron Aug. 23, 2024, 11:05 a.m. UTC | #7
On Thu, 6 Jun 2024 15:21:02 +0200
Lukas Wunner <lukas@wunner.de> wrote:

> On Thu, Jun 06, 2024 at 01:57:56PM +0100, Jonathan Cameron wrote:
> > Or are you thinking we can make something like the following work
> > (even when we can't do dynamic msix)?
> > 
> > Core bring up facilities and interrupt etc.  That includes bus master
> > so msi/msix can be issued (which makes me nervous - putting aside other
> > concerns as IIRC there are drivers where you have to be careful to
> > tweak registers to ensure you don't get a storm of traffic when you
> > hit bus master enable.
> > 
> > Specific driver may bind later - everything keeps running until 
> > the specific driver calls pci_alloc_irq_vectors(), then we have to disable all
> > interrupts for core managed services, change the number of vectors and
> > then move them to wherever they end up before starting them again.
> > We have to do the similar in the unwind path.  
> 
> My recollection is that Thomas Gleixner has brought up dynamic MSI-X
> allocation.  So if both the PCI core services as well as the driver
> use MSI-X, there's no problem.
> 
> For MSI, one approach might be to reserve a certain number of vectors
> in the core for later use by the driver.

So, my first attempt at doing things in the core ran into what I think
is probably a blocker. It's not entirely technical...

+CC Thomas who can probably very quickly confirm if my reasoning is
correct.

If we move these services into the PCI core itself (as opposed keeping
a PCI portdrv layer), then we need to bring up MSI for a device before
we bind a driver to that struct pci_dev / struct device.

If we follow through
pci_alloc_irq_vectors_affinity()
-> __pci_enable_msix_range() / __pci_enable_msi_range()
-> pci_setup_msi_context()
-> msi_setup_device_data()
-> devres_add()
//there is actually devres usage before this in msi_sysfs_create_group()
  but this is a shorter path to illustrate the point.

We will have registered a dev_res callback on the struct pci_dev
that we later want to be able to bind a driver to.  That driver
won't bind - because lifetimes of devres will be garbage
(see really_probe() for the specific check and resulting
"Resources present before probing")

So we in theory 'could' provide a non devres path through
the MSI code, but I'm going to guess that Thomas will not
be particularly keen on that to support this single use case.

Thomas, can you confirm if this is something you'd consider
or outright reject? Or is there an existing route to have 
MSI/X working pre driver bind and still be able to bind
the driver later.

Hence, subject to Thomas reply to above, I think we are back
to having a pci portdrv,

The options then become a case of tidying everything up and
supporting one of (or combination of).

1) Make all functionality currently handled by portdrv
   available in easily consumed form.  Handle 'extended' drivers
   by unbinding the class driver and binding another one (if
   the more specific driver happened not to bind) - we could
   make this consistent by checking for class matches first
   so we always land the generic driver if multiple are ready
   to go.  This may be a simple as an
   'devm_init_pcie_portdrv_standard(int mymaxmsgnum);' in probe
   of an extended driver.

2) Add lightweight static path to adding additional 'services'
   to portdrv.  Similar to current detection code to work out
   what message numbers are in going to be needed
3) Make portdrv support dynamic child drivers including increasing
   number of irq vectors if needed.
4) Do nothing at all and revisit next year (that 'worked' for
   last few years :)

Option 3 is most flexible but is it worth it?
- If we resize irq vectors, we will need to free all interrupts
  anyway and then reallocate them because they may move (maybe
  we can elide that if we are lucky for some devices).
- Will need to handle missed interrupts whilst they were disabled.
- Each 'child driver' will have to provide a 'detection' routine
  that will be called for all registered instances so that we can
  match against particular capabilities, stuff from BARs etc.
- Maybe we treat the 'normal' facilities from the PCI spec specially
  to avoid the interrupt enable/disable/enable dances except when
  there is something 'extra' present. That is more likely to hide
  odd bugs however.

I'm thinking option 3 is a case of over engineering something
better solved by 1 (or maybe 2).

There will be an LPC uconf session on this, but I'm going to
have some more time before LPC for exploration, so any inputs
now might help focus that effort. If not, see you in Vienna.

Jonathan
Thomas Gleixner Aug. 28, 2024, 9:11 p.m. UTC | #8
Jonathan!

On Fri, Aug 23 2024 at 12:05, Jonathan Cameron wrote:
> On Thu, 6 Jun 2024 15:21:02 +0200 Lukas Wunner <lukas@wunner.de> wrote:
>> On Thu, Jun 06, 2024 at 01:57:56PM +0100, Jonathan Cameron wrote:
>> > Or are you thinking we can make something like the following work
>> > (even when we can't do dynamic msix)?
>> > 
>> > Core bring up facilities and interrupt etc.  That includes bus master
>> > so msi/msix can be issued (which makes me nervous - putting aside other
>> > concerns as IIRC there are drivers where you have to be careful to
>> > tweak registers to ensure you don't get a storm of traffic when you
>> > hit bus master enable.

Correct. You have to be really careful about this.

>> > Specific driver may bind later - everything keeps running until 
>> > the specific driver calls pci_alloc_irq_vectors(), then we have to disable all
>> > interrupts for core managed services, change the number of vectors and
>> > then move them to wherever they end up before starting them again.
>> > We have to do the similar in the unwind path.  
>> 
>> My recollection is that Thomas Gleixner has brought up dynamic MSI-X
>> allocation.  So if both the PCI core services as well as the driver
>> use MSI-X, there's no problem.

Correct. If the core sets up N interrupts for core usage, then devices
can later on allocate MSI-X vectors on top if the underlying interrupt
domain hierarchy supports it. But see below.

>> For MSI, one approach might be to reserve a certain number of vectors
>> in the core for later use by the driver.

MSI is a problem in several aspects.

  1) You really have to know how many vectors you need at the end as you
     need to disable MSI in order to change the number of vectors in the
     config space. So if you do that after interrupts are already in use
     this can cause lost interrupts and stale devices. When MSI is
     disabled the interrupt might be routed to the legacy intX and aside
     of an eventual spurious interrupt message there is no trace of it.

  2) Allocating N vectors upfront and only using a small portion for the
     core and have the rest handy for drivers is problematic when the
     MSI implementation does not provide masking of the individual MSI
     interrupts. The driver probing might result in an interrupt
     storm for obvious reasons.

Aside of that if the core allocates N interrupts then all the resources
required (interrupt descriptors....CPU vectors) are allocated right
away. There is no reasonable way to do that post factum like we can do
with MSI-X. Any attempt to hack that into submission is futile and I
have zero interrest to even look at it.

The shortcomings of MSI are known for two decades and it's sufficiently
documented that it's a horrorshow for software to deal with it. Though
the 'save 5 gates and deal with it in software' camp still has not got
the memo. TBH, I don't care at all.

We can talk about MSI-X, but MSI is not going to gain support for any of
this. That's the only leverage we have to educate people who refuse to
accept reality.

> So, my first attempt at doing things in the core ran into what I think
> is probably a blocker. It's not entirely technical...
>
> +CC Thomas who can probably very quickly confirm if my reasoning is
> correct.
>
> If we move these services into the PCI core itself (as opposed keeping
> a PCI portdrv layer), then we need to bring up MSI for a device before
> we bind a driver to that struct pci_dev / struct device.
>
> If we follow through
> pci_alloc_irq_vectors_affinity()
> -> __pci_enable_msix_range() / __pci_enable_msi_range()
> -> pci_setup_msi_context()
> -> msi_setup_device_data()
> -> devres_add()
> //there is actually devres usage before this in msi_sysfs_create_group()
>   but this is a shorter path to illustrate the point.
>
> We will have registered a dev_res callback on the struct pci_dev
> that we later want to be able to bind a driver to.  That driver
> won't bind - because lifetimes of devres will be garbage
> (see really_probe() for the specific check and resulting
> "Resources present before probing")

That's because you are violating all layering rules. See below.

> So we in theory 'could' provide a non devres path through
> the MSI code, but I'm going to guess that Thomas will not
> be particularly keen on that to support this single use case.

Correct.

> Thomas, can you confirm if this is something you'd consider
> or outright reject? Or is there an existing route to have 
> MSI/X working pre driver bind and still be able to bind
> the driver later.

The initial enable has to set up at least one vector. That vector does
not have to be requested by anything, but that's it. After that you can
allocate with pci_msix_alloc_irq_at().

So looking at the ASCII art of the cover letter:

 _____________ __            ________ __________
|  Port       |  | creates  |        |          |
|  PCI Dev    |  |--------->| CPMU A |          |
|_____________|  |          |________|          |
|portdrv binds   |          | perf/cxlpmu binds |
|________________|          |___________________|
                 \         
                  \
                   \         ________ __________
                    \       |        |          |
                     ------>| CPMU B |          |    
                            |________|          |
                            | perf/cxlpmu binds |
                            |___________________|

AND

 _____________ __ 
|  Type 3     |  | creates                                 ________ _________
|  PCI Dev    |  |--------------------------------------->|        |         |
|_____________|  |                                        | CPMU C |         |
| cxl_pci binds  |                                        |________|         |
|________________|                                        | perf/cxpmu binds |
                                                          |__________________|

If I understand it correctly then both the portdrv and the cxl_pci
drivers create a "bus". The CPMU devices are attached to these buses.

So we are talking about distinctly different devices with the twist that
these devices somehow need to utilize the MSI/X (forget MSI) facility of
the device which creates the bus.

From the devres perspective we look at separate devices and that's what
the interrupt code expects too. This reminds me of the lengthy
discussion we had about IMS a couple of years ago.

  https://lore.kernel.org/all/87bljg7u4f.fsf@nanos.tec.linutronix.de/

My view on that issue was wrong because the Intel people described the
problem wrong. But the above pretty much begs for a proper separation
and hierarchical design because you provide an actual bus and distinct
devices. Reusing the ASCII art from that old thread for the second case,
but it's probably the same for the first one:

           ]-------------------------------------------|
           | PCI device                                |
           ]-------------------|                       |
           | Physical function |                       |
           ]-------------------|                       |
           ]-------------------|----------|            |
           | Control block for subdevices |            |
           ]------------------------------|            |
           |            | <- "Subdevice BUS"           |
           |            |                              |
           |            |-- Subddevice 0               | 
           |            |-- Subddevice 1               | 
           |            |-- ...                        | 
           |            |-- Subddevice N               | 
           ]-------------------------------------------|

1) cxl_pci driver binds to the PCI device.

2) cxl_pci driver creates AUXbus

3) Bus enumerates devices on AUXbus

4) Drivers bind to the AUXbus devices

So you have a clear provider consumer relationship. Whether the
subdevice utilizes resources of the PCI device or not is a hardware
implementation detail.

The important aspect is that you want to associate the subdevice
resources to the subdevice instances and not to the PCI device which
provides them.

Let me focus on interrupts, but it's probably the same for everything
else which is shared.

Look at how the PCI device manages interrupts with the per device MSI
mechanism. Forget doing this with either one of the legacy mechanisms.

  1) It creates a hierarchical interrupt domain and gets the required
     resources from the provided parent domain. The PCI side does not
     care whether this is x86 or arm64 and it neither cares whether the
     parent domain does remapping or not. The only way it cares is about
     the features supported by the different parent domains (think
     multi-MSI).
     
  2) Driver side allocations go through the per device domain

That AUXbus is not any different. When the CPMU driver binds it wants to
allocate interrupts. So instead of having a convoluted construct
reaching into the parent PCI device, you simply can do:

  1) Let the cxl_pci driver create a MSI parent domain and set that in
     the subdevice::msi::irqdomain pointer.

  2) Provide cxl_aux_bus_create_irqdomain() which allows the CPMU device
     driver to create a per device interrupt domain.

  3) Let the CPMU driver create its per device interrupt domain with
     the provided parent domain

  4) Let the CPMU driver allocate its MSI-X interrupts through the per
     device domain

Now on the cxl_pci side the AUXbus parent interrupt domain allocation
function does:

    if (!pci_msix_enabled(pdev))
    	return pci_msix_enable_and_alloc_irq_at(pdev, ....);
    else
        return pci_msix_alloc_irq_at(pdev, ....);

The condition can be avoided if the clx_pci driver enables MSI-X anyway
for it's own purposes. Obviously you need the corresponding counterpart
for free(), but that's left as an exercise for the reader.

That still associates the subdevice specific MSI-X interrups to the
underlying PCI device, but then you need to look at teardown. The
cxl_pci driver cannot go away before the CPMU drivers are shutdown.

The CPMU driver shutdown releases the interrupts through the interrupt
domain hierarchy, which removes them from the parent PCI device. Once
all CPMU drivers are gone, the shutdown of the cxl_pci driver gets rid
of the cxl_pci driver specific interrupts and everything is cleaned up.

This works out of the box on x86. The rest needs to gain support for
MSI_FLAG_PCI_MSIX_ALLOC_DYN. ARM64 and parts of ARM32 are already
converted over to the MSI parent domain concept, they just lack support
for dynamic allocation. The rest of the arch/ world needs some more
work. That's the problem of the architecture folks and that CXL auxbus
thing can simply tell them

      if (!pci_msix_can_alloc_dyn(pdev))
      		return -E_MAKE_YOUR_HOME_WORK;

and be done with it. Proliferating support for "two" legacy PCI/MSI
mechanisms by violating layering is not going to happen.

Neither I'm interrested to have creative workarounds for MSI as they are
going to create more problems than they solve, unless you come up with a
clean, simple and maintainable solution which works under all
circumstances. I'm not holding my breath...

I might not have all the crucial information as it happened in the
previous IMS discussion, but looking at your ASCII art makes me halfways
confident that I'm on the right track.

Thanks,

        tglx
Jonathan Cameron Aug. 29, 2024, 12:17 p.m. UTC | #9
On Wed, 28 Aug 2024 23:11:46 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> Jonathan!

Hi Thomas,

Thanks for taking a look.  This has slightly suffered from being
a thread that diverged a long way from the original patch set.
So the question here wasn't really related to the cover letter :(
I think much of the discussion still applies however and some
of it answers questions I didn't even know I had.

Some new ascii art follows with the various options we are
considering (having ruled out the one this patch set covers).

The meat of the query was whether MSI or MSIX can be used prior
to binding a driver to struct device (pci_dev).  In theory
possible, but the infrastructure currently doesn't enable this.
The rest is details of the pain in actually using that if it
is possible.  You answered that I think so all good.

> 
> On Fri, Aug 23 2024 at 12:05, Jonathan Cameron wrote:
> > On Thu, 6 Jun 2024 15:21:02 +0200 Lukas Wunner <lukas@wunner.de> wrote:  
> >> On Thu, Jun 06, 2024 at 01:57:56PM +0100, Jonathan Cameron wrote:  
> >> > Or are you thinking we can make something like the following work
> >> > (even when we can't do dynamic msix)?
> >> > 
> >> > Core bring up facilities and interrupt etc.  That includes bus master
> >> > so msi/msix can be issued (which makes me nervous - putting aside other
> >> > concerns as IIRC there are drivers where you have to be careful to
> >> > tweak registers to ensure you don't get a storm of traffic when you
> >> > hit bus master enable.  
> 
> Correct. You have to be really careful about this.
> 
> >> > Specific driver may bind later - everything keeps running until 
> >> > the specific driver calls pci_alloc_irq_vectors(), then we have to disable all
> >> > interrupts for core managed services, change the number of vectors and
> >> > then move them to wherever they end up before starting them again.
> >> > We have to do the similar in the unwind path.    
> >> 
> >> My recollection is that Thomas Gleixner has brought up dynamic MSI-X
> >> allocation.  So if both the PCI core services as well as the driver
> >> use MSI-X, there's no problem.  
> 
> Correct. If the core sets up N interrupts for core usage, then devices
> can later on allocate MSI-X vectors on top if the underlying interrupt
> domain hierarchy supports it. But see below.
> 
> >> For MSI, one approach might be to reserve a certain number of vectors
> >> in the core for later use by the driver.  
> 
> MSI is a problem in several aspects.
> 
>   1) You really have to know how many vectors you need at the end as you
>      need to disable MSI in order to change the number of vectors in the
>      config space. So if you do that after interrupts are already in use
>      this can cause lost interrupts and stale devices. When MSI is
>      disabled the interrupt might be routed to the legacy intX and aside
>      of an eventual spurious interrupt message there is no trace of it.

Ah. I'd not thought of the intX fallback. Ouch, so we'd definitely
need to do device side masking of all the sources before disabling MSI
and poll all the status registers to see if we missed anything.
That's not implausible to implement if fiddly.

> 
>   2) Allocating N vectors upfront and only using a small portion for the
>      core and have the rest handy for drivers is problematic when the
>      MSI implementation does not provide masking of the individual MSI
>      interrupts. The driver probing might result in an interrupt
>      storm for obvious reasons.

Ouch.  We 'should' be ok on that because the device should be well behaved
and as long as a given feature isn't enabled. If there are devices out
there that come up with an interrupt enabled by default (to which portdrv
attaches), then this problem would already have surfaced.
There might be some corners where a feature isn't enabled unless enough
vectors are requested but even those should have surfaced due to the
portdrv 'find what is used' dance where it first requests all the MSI
vectors, then checks what numbers are used (which aren't known until
MSI is enabled) and then releases all the vectors.

> 
> Aside of that if the core allocates N interrupts then all the resources
> required (interrupt descriptors....CPU vectors) are allocated right
> away. There is no reasonable way to do that post factum like we can do
> with MSI-X. Any attempt to hack that into submission is futile and I
> have zero interrest to even look at it.
> 
> The shortcomings of MSI are known for two decades and it's sufficiently
> documented that it's a horrorshow for software to deal with it. Though
> the 'save 5 gates and deal with it in software' camp still has not got
> the memo. TBH, I don't care at all.

Snag here is the aim is refactor a driver that has to handle ancient
implementation without breaking.  We can enable new shiny stuff for
MSI-X only but I don't want parallel infrastructure.

> 
> We can talk about MSI-X, but MSI is not going to gain support for any of
> this. That's the only leverage we have to educate people who refuse to
> accept reality.

Fully on board with the aim, but legacy is the pain point here.

> 
> > So, my first attempt at doing things in the core ran into what I think
> > is probably a blocker. It's not entirely technical...
> >
> > +CC Thomas who can probably very quickly confirm if my reasoning is
> > correct.
> >
> > If we move these services into the PCI core itself (as opposed keeping
> > a PCI portdrv layer), then we need to bring up MSI for a device before
> > we bind a driver to that struct pci_dev / struct device.
> >
> > If we follow through
> > pci_alloc_irq_vectors_affinity()  
> > -> __pci_enable_msix_range() / __pci_enable_msi_range()
> > -> pci_setup_msi_context()
> > -> msi_setup_device_data()
> > -> devres_add()  
> > //there is actually devres usage before this in msi_sysfs_create_group()
> >   but this is a shorter path to illustrate the point.
> >
> > We will have registered a dev_res callback on the struct pci_dev
> > that we later want to be able to bind a driver to.  That driver
> > won't bind - because lifetimes of devres will be garbage
> > (see really_probe() for the specific check and resulting
> > "Resources present before probing")  
> 
> That's because you are violating all layering rules. See below.

Agreed.  I never liked this approach. This was an exercise
in ruling it out.

> 
> > So we in theory 'could' provide a non devres path through
> > the MSI code, but I'm going to guess that Thomas will not
> > be particularly keen on that to support this single use case.  
> 
> Correct.

Excellent.  

> 
> > Thomas, can you confirm if this is something you'd consider
> > or outright reject? Or is there an existing route to have 
> > MSI/X working pre driver bind and still be able to bind
> > the driver later.  
> 
> The initial enable has to set up at least one vector. That vector does
> not have to be requested by anything, but that's it. After that you can
> allocate with pci_msix_alloc_irq_at().
> 
> So looking at the ASCII art of the cover letter:

This is focused on the CXL CPMU thing which is where the patch set
was focused, but not so much the discussion on 'fixing' portdrv.

> 
>  _____________ __            ________ __________
> |  Port       |  | creates  |        |          |
> |  PCI Dev    |  |--------->| CPMU A |          |
> |_____________|  |          |________|          |
> |portdrv binds   |          | perf/cxlpmu binds |
> |________________|          |___________________|
>                  \         
>                   \
>                    \         ________ __________
>                     \       |        |          |
>                      ------>| CPMU B |          |      
>                             |________|          |
>                             | perf/cxlpmu binds |
>                             |___________________|
> 
> AND
> 
>  _____________ __ 
> |  Type 3     |  | creates                                 ________ _________
> |  PCI Dev    |  |--------------------------------------->|        |         |
> |_____________|  |                                        | CPMU C |         |
> | cxl_pci binds  |                                        |________|         |
> |________________|                                        | perf/cxpmu binds |
>                                                           |__________________|
> 
> If I understand it correctly then both the portdrv and the cxl_pci
> drivers create a "bus". The CPMU devices are attached to these buses.
> 
> So we are talking about distinctly different devices with the twist that
> these devices somehow need to utilize the MSI/X (forget MSI) facility of
> the device which creates the bus.

yes.

> 
> From the devres perspective we look at separate devices and that's what
> the interrupt code expects too. This reminds me of the lengthy
> discussion we had about IMS a couple of years ago.
> 
>   https://lore.kernel.org/all/87bljg7u4f.fsf@nanos.tec.linutronix.de/
> 
> My view on that issue was wrong because the Intel people described the
> problem wrong. But the above pretty much begs for a proper separation
> and hierarchical design because you provide an actual bus and distinct
> devices. Reusing the ASCII art from that old thread for the second case,
> but it's probably the same for the first one:
> 
>            ]-------------------------------------------|
>            | PCI device                                |
>            ]-------------------|                       |
>            | Physical function |                       |
>            ]-------------------|                       |
>            ]-------------------|----------|            |
>            | Control block for subdevices |            |
>            ]------------------------------|            |
>            |            | <- "Subdevice BUS"           |
>            |            |                              |
>            |            |-- Subddevice 0               | 
>            |            |-- Subddevice 1               | 
>            |            |-- ...                        | 
>            |            |-- Subddevice N               | 
>            ]-------------------------------------------|
> 
> 1) cxl_pci driver binds to the PCI device.
> 
> 2) cxl_pci driver creates AUXbus

It's using a different CXL specific bus, but I the
discussion still holds and it's very auxbus like.

> 
> 3) Bus enumerates devices on AUXbus
> 
> 4) Drivers bind to the AUXbus devices
> 
> So you have a clear provider consumer relationship. Whether the
> subdevice utilizes resources of the PCI device or not is a hardware
> implementation detail.
> 
> The important aspect is that you want to associate the subdevice
> resources to the subdevice instances and not to the PCI device which
> provides them.
> 
> Let me focus on interrupts, but it's probably the same for everything
> else which is shared.
> 
> Look at how the PCI device manages interrupts with the per device MSI
> mechanism. Forget doing this with either one of the legacy mechanisms.

Unfortunately legacy is a problem because the support for MSI is
already upstream.  So anything we do to cxl_pci and children
is an optimization only.

This is probably doable for the portdrv refactoring of extensions + hanging
the CXL PMU off that though as we don't currently have upstream support
for that combination (so no chance of regressions)

Anyway I'll read this as talking about portdrv.c and the CPMU driver
binding to an auxbus device rather than cxl_pci.

> 
>   1) It creates a hierarchical interrupt domain and gets the required
>      resources from the provided parent domain. The PCI side does not
>      care whether this is x86 or arm64 and it neither cares whether the
>      parent domain does remapping or not. The only way it cares is about
>      the features supported by the different parent domains (think
>      multi-MSI).
>      
>   2) Driver side allocations go through the per device domain
> 
> That AUXbus is not any different. When the CPMU driver binds it wants to
> allocate interrupts. So instead of having a convoluted construct
> reaching into the parent PCI device, you simply can do:
> 
>   1) Let the cxl_pci driver create a MSI parent domain and set that in
>      the subdevice::msi::irqdomain pointer.
> 
>   2) Provide cxl_aux_bus_create_irqdomain() which allows the CPMU device
>      driver to create a per device interrupt domain.
> 
>   3) Let the CPMU driver create its per device interrupt domain with
>      the provided parent domain
> 
>   4) Let the CPMU driver allocate its MSI-X interrupts through the per
>      device domain
> 
> Now on the cxl_pci side the AUXbus parent interrupt domain allocation
> function does:
> 
>     if (!pci_msix_enabled(pdev))
>     	return pci_msix_enable_and_alloc_irq_at(pdev, ....);
>     else
>         return pci_msix_alloc_irq_at(pdev, ....);
> 
> The condition can be avoided if the clx_pci driver enables MSI-X anyway
> for it's own purposes. Obviously you need the corresponding counterpart
> for free(), but that's left as an exercise for the reader.
> 
> That still associates the subdevice specific MSI-X interrups to the
> underlying PCI device, but then you need to look at teardown. The
> cxl_pci driver cannot go away before the CPMU drivers are shutdown.
>
> The CPMU driver shutdown releases the interrupts through the interrupt
> domain hierarchy, which removes them from the parent PCI device. Once
> all CPMU drivers are gone, the shutdown of the cxl_pci driver gets rid
> of the cxl_pci driver specific interrupts and everything is cleaned up.
> 
> This works out of the box on x86. The rest needs to gain support for
> MSI_FLAG_PCI_MSIX_ALLOC_DYN. ARM64 and parts of ARM32 are already
> converted over to the MSI parent domain concept, they just lack support
> for dynamic allocation. The rest of the arch/ world needs some more
> work. That's the problem of the architecture folks and that CXL auxbus
> thing can simply tell them
> 
>       if (!pci_msix_can_alloc_dyn(pdev))
>       		return -E_MAKE_YOUR_HOME_WORK;

Given my focus I need the arm64 case, but fine with this for others.

> 
> and be done with it. Proliferating support for "two" legacy PCI/MSI
> mechanisms by violating layering is not going to happen.
> 
> Neither I'm interrested to have creative workarounds for MSI as they are
> going to create more problems than they solve, unless you come up with a
> clean, simple and maintainable solution which works under all
> circumstances. I'm not holding my breath...

Nor am I.

> 
> I might not have all the crucial information as it happened in the
> previous IMS discussion, but looking at your ASCII art makes me halfways
> confident that I'm on the right track.

So that background I mentioned on the options for portdrv.
Note that the CPMU bit is an extension to this.
Much as I'd love to break non per device msix /dynamic msix it is a
non starter given need to improve existing driver.
However enabling CPMU and other extensions only for dynamic msix seems
fine to me.

Option 1. Tidy up today's solution only
---------------------------------------
Layering violations etc as you state above, but IRQ + MSI has to
work.

Port Drv                               PME                     AER 
--------                               ---                     ---
   |
0. Bind to pci_dev                
1. Set max MSI/MSIX
2. Read per subdriver msgnum            
3. Set to max msgnum seen             
4. Bus master enable etc.               
   |                                    
5. Register pci_express bus devices
   | (note this is a weird bus we'd all
   |  like to get rid of)
   |---------------------------------->|
   |                              6. Bind PME driver
   |                              7. Look up msgnum (again)
   |                              8. request_irq()
   |                              9. enable PME int (device side)
   |                                   | 
   |                                   |          
   |---------------------------------------------------> 6a. Bind AER driver
   |                                   |                 7a. Look up msgnum(again)
   |                                   |                 8a. request_irq()
   |                                   |                 9a. handle AER int (device side)
   |                                   |                        |   
10. unbind parent.                     |                        |
11. unregister children,               |                        |
   |-----------------------------12. PME remove() does...       |
   |                             13. disable int (device side)  |
   |                             14. free_irq() and cancel work |
   |                                                            |
   |--------------------------------------------------->12a. AER remove() does...
   |                                                    13a. Disable AER int (dev side)
   |                                                    14a. free irq cancel work etc.
   |
15. Finish remove pordrv remove()

- New functions added via same framework, so register on the pci_express bus.
  or via an auxbus. The auxbus was focus of this series.  If we are switching
  to a different bus, can do the approach you suggest and dynamic MSI-X only,
  - would be odd to do this for the pci_express bus given need the layering
    breakage anyway to keep MSI working.

(I think option 3 is better as it gets rid of existing layering issues
 that you called out).

Option 2:  What provoked this bit of the thread (but not the original thread which was
           option 1).

(This doesn't work - I think)

The PCI core would have to work with out dynamic MSIX as legacy but
we could do what you suggested for the 'extension' part and only
succeed in Extended PCI Port driver probing if that feature is
available (and safe on device)

Aim here was that 'core' PCI defined services 'should' need a driver bound.
This is true for some already.

PCI core                                                Extended PCI Port driver.
   |                                                    This only handles non PCI spec stuff.
   |
 1. Enumerate device
 2. Do interrupt discover as 
    previously done in portdrv
 3. Enable irq / MSI / MSIX as appropriate for
    AER, PME, DPC etc but not extended features.
 (note this would require binding to the device,
  which doesn't have a driver bound yet).
 4. Provide all the services original in portdrv
   |
   |
 5. Driver bind happens (fails because of note above).
   |------------------------------------------------------------|
   |                                                      6. request MSI/MSIX vectors.
   |<-----------------------------------------------------------|
6. Quiesce AER PME Etc.                                         .
7. Unbind all irqs (they may move anyway)                       .
8. Release msi/msix vecotrs etc.                                .
9. Request expanded set of MSI/MISX                             .
10. request all AER/PME etc irqs again.                         .
11. Check for missed interrupts (status register reads)         .
    and handle if we did.                                       . 
  |------------------------------------------------------------>|
  |                                                         Carry on as normal.

With your notes on dynamic MSIx and domains, this could be simplified
but falls down on the fact we can't handle stuff pre driver bind.
We could do a layered driver with domains and maybe hide that in the PCI
core, but why bother. (Option dead because you quite rightly are
not interested in more layering violations to solve that devres
issue).

Option 3: Move to flatter design for existing portdrv (+ children)
    The children (standard features) become library function calls
    Extensibility solved in 2a and 2b below.

Port Drv
   |
1. Set max MSI/MSIX
2. Read per subdriver msgnum
3. Set to max msgnum seen
4. Bus master enable etc.
5. PME setup
      Call pme_init(pci_dev);
      Request irq etc
6. AER setup
      Call aer_rp_init(pci_dev)'
      Request irq etc.
   |
   |
   |
7. unbind parent.
8. remove calls pme_exit(pci_dev)
9. remove calls aer_exit(pci_dev)
etc

So how to make this extensible
------------------------------
Option 3a:
- Augment main driver - possibly only if MSIX and
  dynamic allocation supported.

Option 3b:
- Leave main driver alone, extension requires binding
  a specific driver. Advantage is this is simple and
  that we don't end up with an ever increasing amount of
  'feature' discovery code in a single driver.

Note that this choice has nothing to do with the interrupt bits
but is more about feature discovery.

Either way, structure looks like above for PCI spec
defined 'normal' services + extra.

Port Drv                                                CXL PMU driver
   |                                                         |
1. Set max MSI/MSIX
2. Read per subdriver msgnum
3. Set to max msgnum seen
   If handling interrupts other than dynamic msix
   then we need to find child driver interrupts here
   too. Note that the Portdrv has to have specific code
   to enumerate the 'features' for which child devices
   are created anyway, so this isn't much worse.

4. Bus master enable etc.
5. PME setup
      Call pme_init(pci_dev);
      Request irq etc
6. AER setup
      Call aer_rp_init(pci_dev)'
      Request irq etc.
   |
7. Create create child devices
   |------------------------------------------------->Non dynamic msix
                                                      8. Request IRQ.
                                                      9. Normal driver operation.
                                                      Dynamic msix
                                                      8. Setup per device MSI Domain
                                                      9. request msix vectors
                                                         as you describe above.


So currently I'm thinking option 3 and with your inputs above
only support CXL PMU and other extended features with dynamic MSI-X.
Hopefully anyone who has built an MSI only CXL switch with a CPMU hasn't
read this far so won't object :)  Congratulations to anyone who
gets this far.

My current plan is to do:
1. Squash the various existing pci_express child drivers into the
   portdrv as simple library calls.
If using MSIX.
2. Add interrupt domain / msix stuff as you suggest + CPMU
   child as example.

Whether we end up with option 3a or 3b can be figure out later.

Thanks again for very useful feedback.

Jonathan

   

> 
> Thanks,
> 
>         tglx
Jonathan Cameron Sept. 5, 2024, 11:23 a.m. UTC | #10
Hi Thomas,

One quick follow up question below.

> 
> So looking at the ASCII art of the cover letter:
> 
>  _____________ __            ________ __________
> |  Port       |  | creates  |        |          |
> |  PCI Dev    |  |--------->| CPMU A |          |
> |_____________|  |          |________|          |
> |portdrv binds   |          | perf/cxlpmu binds |
> |________________|          |___________________|
>                  \         
>                   \
>                    \         ________ __________
>                     \       |        |          |
>                      ------>| CPMU B |          |      
>                             |________|          |
>                             | perf/cxlpmu binds |
>                             |___________________|
> 
> AND
> 
>  _____________ __ 
> |  Type 3     |  | creates                                 ________ _________
> |  PCI Dev    |  |--------------------------------------->|        |         |
> |_____________|  |                                        | CPMU C |         |
> | cxl_pci binds  |                                        |________|         |
> |________________|                                        | perf/cxpmu binds |
>                                                           |__________________|
> 
> If I understand it correctly then both the portdrv and the cxl_pci
> drivers create a "bus". The CPMU devices are attached to these buses.
> 
> So we are talking about distinctly different devices with the twist that
> these devices somehow need to utilize the MSI/X (forget MSI) facility of
> the device which creates the bus.
> 
> From the devres perspective we look at separate devices and that's what
> the interrupt code expects too. This reminds me of the lengthy
> discussion we had about IMS a couple of years ago.
> 
>   https://lore.kernel.org/all/87bljg7u4f.fsf@nanos.tec.linutronix.de/
> 
> My view on that issue was wrong because the Intel people described the
> problem wrong. But the above pretty much begs for a proper separation
> and hierarchical design because you provide an actual bus and distinct
> devices. Reusing the ASCII art from that old thread for the second case,
> but it's probably the same for the first one:
> 
>            ]-------------------------------------------|
>            | PCI device                                |
>            ]-------------------|                       |
>            | Physical function |                       |
>            ]-------------------|                       |
>            ]-------------------|----------|            |
>            | Control block for subdevices |            |
>            ]------------------------------|            |
>            |            | <- "Subdevice BUS"           |
>            |            |                              |
>            |            |-- Subddevice 0               | 
>            |            |-- Subddevice 1               | 
>            |            |-- ...                        | 
>            |            |-- Subddevice N               | 
>            ]-------------------------------------------|
> 
> 1) cxl_pci driver binds to the PCI device.
> 
> 2) cxl_pci driver creates AUXbus
> 
> 3) Bus enumerates devices on AUXbus
> 
> 4) Drivers bind to the AUXbus devices
> 
> So you have a clear provider consumer relationship. Whether the
> subdevice utilizes resources of the PCI device or not is a hardware
> implementation detail.
> 
> The important aspect is that you want to associate the subdevice
> resources to the subdevice instances and not to the PCI device which
> provides them.
> 
> Let me focus on interrupts, but it's probably the same for everything
> else which is shared.
> 
> Look at how the PCI device manages interrupts with the per device MSI
> mechanism. Forget doing this with either one of the legacy mechanisms.
> 
>   1) It creates a hierarchical interrupt domain and gets the required
>      resources from the provided parent domain. The PCI side does not
>      care whether this is x86 or arm64 and it neither cares whether the
>      parent domain does remapping or not. The only way it cares is about
>      the features supported by the different parent domains (think
>      multi-MSI).
>      
>   2) Driver side allocations go through the per device domain
> 
> That AUXbus is not any different. When the CPMU driver binds it wants to
> allocate interrupts. So instead of having a convoluted construct
> reaching into the parent PCI device, you simply can do:
> 
>   1) Let the cxl_pci driver create a MSI parent domain and set that in
>      the subdevice::msi::irqdomain pointer.
> 
>   2) Provide cxl_aux_bus_create_irqdomain() which allows the CPMU device
>      driver to create a per device interrupt domain.
> 
>   3) Let the CPMU driver create its per device interrupt domain with
>      the provided parent domain
> 
>   4) Let the CPMU driver allocate its MSI-X interrupts through the per
>      device domain
> 
> Now on the cxl_pci side the AUXbus parent interrupt domain allocation
> function does:
> 
>     if (!pci_msix_enabled(pdev))
>     	return pci_msix_enable_and_alloc_irq_at(pdev, ....);
>     else
>         return pci_msix_alloc_irq_at(pdev, ....);

Hi Thomas,

I'm struggling to follow this suggestion
Would you expect the cxl_pci MSI parent domain to set it's parent as
msi_domain = irq_domain_create_hierarchy(dev_get_msi_domain(&pdev->dev),
					 IRQ_DOMAIN_FLAG_MSI_PARENT,
					 ...
which seems to cause a problem with deadlocks in __irq_domain_alloc_irqs()
or create a separate domain structure and provide callbacks that reach into
the parent domain as necessary?

Or do I have this entirely wrong? I'm struggling to relate what existing
code like PCI does to what I need to do here.

Jonathan
Thomas Gleixner Sept. 6, 2024, 10:11 a.m. UTC | #11
On Thu, Sep 05 2024 at 12:23, Jonathan Cameron wrote:
>> Look at how the PCI device manages interrupts with the per device MSI
>> mechanism. Forget doing this with either one of the legacy mechanisms.
>> 
>>   1) It creates a hierarchical interrupt domain and gets the required
>>      resources from the provided parent domain. The PCI side does not
>>      care whether this is x86 or arm64 and it neither cares whether the
>>      parent domain does remapping or not. The only way it cares is about
>>      the features supported by the different parent domains (think
>>      multi-MSI).
>>      
>>   2) Driver side allocations go through the per device domain
>> 
>> That AUXbus is not any different. When the CPMU driver binds it wants to
>> allocate interrupts. So instead of having a convoluted construct
>> reaching into the parent PCI device, you simply can do:
>> 
>>   1) Let the cxl_pci driver create a MSI parent domain and set that in
>>      the subdevice::msi::irqdomain pointer.
>> 
>>   2) Provide cxl_aux_bus_create_irqdomain() which allows the CPMU device
>>      driver to create a per device interrupt domain.
>> 
>>   3) Let the CPMU driver create its per device interrupt domain with
>>      the provided parent domain
>> 
>>   4) Let the CPMU driver allocate its MSI-X interrupts through the per
>>      device domain
>> 
>> Now on the cxl_pci side the AUXbus parent interrupt domain allocation
>> function does:
>> 
>>     if (!pci_msix_enabled(pdev))
>>     	return pci_msix_enable_and_alloc_irq_at(pdev, ....);
>>     else
>>         return pci_msix_alloc_irq_at(pdev, ....);

Ignore the above. Brainfart.

> I'm struggling to follow this suggestion
> Would you expect the cxl_pci MSI parent domain to set it's parent as
> msi_domain = irq_domain_create_hierarchy(dev_get_msi_domain(&pdev->dev),
> 					 IRQ_DOMAIN_FLAG_MSI_PARENT,
> 					 ...
> which seems to cause a problem with deadlocks in __irq_domain_alloc_irqs()
> or create a separate domain structure and provide callbacks that reach into
> the parent domain as necessary?
>
> Or do I have this entirely wrong? I'm struggling to relate what existing
> code like PCI does to what I need to do here.

dev_get_msi_domain(&pdev->dev) is a nightmare due to the 4 different
models we have:

   - Legacy has no domain

   - Non-hierarchical domain

   - Hierarchical domain v1

         That's the global PCI/MSI domain

   - Hierarchical domain v2

      That's the underlying MSI_PARENT domain, which is on x86
      either the interrupt remap unit or the vector domain. On arm64
      it's the ITS domain.

See e.g. pci_msi_domain_supports() which handles the mess.

Now this proposal will only work on hierarchical domain v2 because that
can do the post enable allocations on MSI-X. Let's put a potential
solution for MSI aside for now to avoid complexity explosion.

So there are two ways to solve this:

  1) Implement the CXL MSI parent domain as disjunct domain

  2) Teach the V2 per device MSI-X domain to be a parent domain

#1 Looks pretty straight forward, but does not seemlessly fit the
   hierarchical domain model and creates horrible indirections.

#2 Is the proper abstraction, but requires to teach the MSI core code
   about stacked MSI parent domains, which should not be horribly hard
   or maybe just works out of the box.

The PCI device driver invokes the not yet existing function
pci_enable_msi_parent_domain(pci_dev). This function does:

  A) validate that this PCI device has a V2 parent domain

  B) validate that the device has enabled MSI-X

  C) validate that the PCI/MSI-X domain supports dynamic MSI-X
     allocation

  D) if #A to #C are true, it enables the PCI device parent domain

I made #B a prerequisite for now, as that's an orthogonal problem, which
does not need to be solved upfront. Maybe it does not need to be solved
at all because the underlying PCI driver always allocates a management
interrupt before dealing with the underlying "bus", which is IMHO a
reasonable expectation. At least it's a reasonable restriction for
getting started.

That function does:

     msix_dom = pci_msi_get_msix_domain(pdev);
     msix_dom->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT;
     msix_dom->msi_parent_ops = &pci_msix_parent_ops;

When creating the CXL devices the CXL code invokes
pci_msi_init_subdevice(pdev, &cxldev->device), which just does:

  dev_set_msi_domain(dev, pci_msi_get_msix_subdevice_parent(pdev));

That allows the CXL devices to create their own per device MSI
domain via a new function pci_msi_create_subdevice_irq_domain().

That function can just use a variant of pci_create_device_domain() with
a different domain template and a different irqchip, where the irqchip
just redirects to the underlying parent domain chip, aka PCI-MSI-X.

I don't think the CXL device will require a special chip as all they
should need to know is the linux interrupt number. If there are special
requirements then we can bring the IMS code back or do something similar
to the platform MSI code. 

Then you need pci_subdev_msix_alloc_at() and pci_subdev_msix_free()
which are the only two functions which the CXL (or similar) need.

The existing pci_msi*() API function might need a safety check so they
can't be abused by subdevices, but that's a problem for a final
solution.

That's pretty much all what it takes. Hope that helps.

Thanks,

        tglx
Jonathan Cameron Sept. 6, 2024, 5:18 p.m. UTC | #12
On Fri, 06 Sep 2024 12:11:36 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Thu, Sep 05 2024 at 12:23, Jonathan Cameron wrote:
> >> Look at how the PCI device manages interrupts with the per device MSI
> >> mechanism. Forget doing this with either one of the legacy mechanisms.
> >> 
> >>   1) It creates a hierarchical interrupt domain and gets the required
> >>      resources from the provided parent domain. The PCI side does not
> >>      care whether this is x86 or arm64 and it neither cares whether the
> >>      parent domain does remapping or not. The only way it cares is about
> >>      the features supported by the different parent domains (think
> >>      multi-MSI).
> >>      
> >>   2) Driver side allocations go through the per device domain
> >> 
> >> That AUXbus is not any different. When the CPMU driver binds it wants to
> >> allocate interrupts. So instead of having a convoluted construct
> >> reaching into the parent PCI device, you simply can do:
> >> 
> >>   1) Let the cxl_pci driver create a MSI parent domain and set that in
> >>      the subdevice::msi::irqdomain pointer.
> >> 
> >>   2) Provide cxl_aux_bus_create_irqdomain() which allows the CPMU device
> >>      driver to create a per device interrupt domain.
> >> 
> >>   3) Let the CPMU driver create its per device interrupt domain with
> >>      the provided parent domain
> >> 
> >>   4) Let the CPMU driver allocate its MSI-X interrupts through the per
> >>      device domain
> >> 
> >> Now on the cxl_pci side the AUXbus parent interrupt domain allocation
> >> function does:
> >> 
> >>     if (!pci_msix_enabled(pdev))
> >>     	return pci_msix_enable_and_alloc_irq_at(pdev, ....);
> >>     else
> >>         return pci_msix_alloc_irq_at(pdev, ....);  
> 
> Ignore the above. Brainfart.
> 
> > I'm struggling to follow this suggestion
> > Would you expect the cxl_pci MSI parent domain to set it's parent as
> > msi_domain = irq_domain_create_hierarchy(dev_get_msi_domain(&pdev->dev),
> > 					 IRQ_DOMAIN_FLAG_MSI_PARENT,
> > 					 ...
> > which seems to cause a problem with deadlocks in __irq_domain_alloc_irqs()
> > or create a separate domain structure and provide callbacks that reach into
> > the parent domain as necessary?
> >
> > Or do I have this entirely wrong? I'm struggling to relate what existing
> > code like PCI does to what I need to do here.  
> 
> dev_get_msi_domain(&pdev->dev) is a nightmare due to the 4 different
> models we have:
> 
>    - Legacy has no domain
> 
>    - Non-hierarchical domain
> 
>    - Hierarchical domain v1
> 
>          That's the global PCI/MSI domain
> 
>    - Hierarchical domain v2
> 
>       That's the underlying MSI_PARENT domain, which is on x86
>       either the interrupt remap unit or the vector domain. On arm64
>       it's the ITS domain.
> 
> See e.g. pci_msi_domain_supports() which handles the mess.
> 
> Now this proposal will only work on hierarchical domain v2 because that
> can do the post enable allocations on MSI-X. Let's put a potential
> solution for MSI aside for now to avoid complexity explosion.
> 
> So there are two ways to solve this:
> 
>   1) Implement the CXL MSI parent domain as disjunct domain
> 
>   2) Teach the V2 per device MSI-X domain to be a parent domain
> 
> #1 Looks pretty straight forward, but does not seemlessly fit the
>    hierarchical domain model and creates horrible indirections.
> 
> #2 Is the proper abstraction, but requires to teach the MSI core code
>    about stacked MSI parent domains, which should not be horribly hard
>    or maybe just works out of the box.
> 
> The PCI device driver invokes the not yet existing function
> pci_enable_msi_parent_domain(pci_dev). This function does:
> 
>   A) validate that this PCI device has a V2 parent domain
> 
>   B) validate that the device has enabled MSI-X
> 
>   C) validate that the PCI/MSI-X domain supports dynamic MSI-X
>      allocation
> 
>   D) if #A to #C are true, it enables the PCI device parent domain
> 
> I made #B a prerequisite for now, as that's an orthogonal problem, which
> does not need to be solved upfront. Maybe it does not need to be solved
> at all because the underlying PCI driver always allocates a management
> interrupt before dealing with the underlying "bus", which is IMHO a
> reasonable expectation. At least it's a reasonable restriction for
> getting started.
> 
> That function does:
> 
>      msix_dom = pci_msi_get_msix_domain(pdev);
>      msix_dom->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT;
>      msix_dom->msi_parent_ops = &pci_msix_parent_ops;
> 
> When creating the CXL devices the CXL code invokes
> pci_msi_init_subdevice(pdev, &cxldev->device), which just does:
> 
>   dev_set_msi_domain(dev, pci_msi_get_msix_subdevice_parent(pdev));
> 
> That allows the CXL devices to create their own per device MSI
> domain via a new function pci_msi_create_subdevice_irq_domain().
> 
> That function can just use a variant of pci_create_device_domain() with
> a different domain template and a different irqchip, where the irqchip
> just redirects to the underlying parent domain chip, aka PCI-MSI-X.
> 
> I don't think the CXL device will require a special chip as all they
> should need to know is the linux interrupt number. If there are special
> requirements then we can bring the IMS code back or do something similar
> to the platform MSI code. 
> 
> Then you need pci_subdev_msix_alloc_at() and pci_subdev_msix_free()
> which are the only two functions which the CXL (or similar) need.
> 
> The existing pci_msi*() API function might need a safety check so they
> can't be abused by subdevices, but that's a problem for a final
> solution.
> 
> That's pretty much all what it takes. Hope that helps.

Absolutely!  Thanks for providing the detailed suggestion
this definitely smells a lot less nasty than previous approach.

I have things sort of working now but it's ugly code with a few
cross layer hacks that need tidying up (around programming the
msi registers from wrong 'device'), so may be a little
while before I get it in a state to post.

Jonathan

> 
> Thanks,
> 
>         tglx
> 
> 
>
Jonathan Cameron Sept. 10, 2024, 4:47 p.m. UTC | #13
On Fri, 6 Sep 2024 18:18:32 +0100
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

> On Fri, 06 Sep 2024 12:11:36 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > On Thu, Sep 05 2024 at 12:23, Jonathan Cameron wrote:  
> > >> Look at how the PCI device manages interrupts with the per device MSI
> > >> mechanism. Forget doing this with either one of the legacy mechanisms.
> > >> 
> > >>   1) It creates a hierarchical interrupt domain and gets the required
> > >>      resources from the provided parent domain. The PCI side does not
> > >>      care whether this is x86 or arm64 and it neither cares whether the
> > >>      parent domain does remapping or not. The only way it cares is about
> > >>      the features supported by the different parent domains (think
> > >>      multi-MSI).
> > >>      
> > >>   2) Driver side allocations go through the per device domain
> > >> 
> > >> That AUXbus is not any different. When the CPMU driver binds it wants to
> > >> allocate interrupts. So instead of having a convoluted construct
> > >> reaching into the parent PCI device, you simply can do:
> > >> 
> > >>   1) Let the cxl_pci driver create a MSI parent domain and set that in
> > >>      the subdevice::msi::irqdomain pointer.
> > >> 
> > >>   2) Provide cxl_aux_bus_create_irqdomain() which allows the CPMU device
> > >>      driver to create a per device interrupt domain.
> > >> 
> > >>   3) Let the CPMU driver create its per device interrupt domain with
> > >>      the provided parent domain
> > >> 
> > >>   4) Let the CPMU driver allocate its MSI-X interrupts through the per
> > >>      device domain
> > >> 
> > >> Now on the cxl_pci side the AUXbus parent interrupt domain allocation
> > >> function does:
> > >> 
> > >>     if (!pci_msix_enabled(pdev))
> > >>     	return pci_msix_enable_and_alloc_irq_at(pdev, ....);
> > >>     else
> > >>         return pci_msix_alloc_irq_at(pdev, ....);    
> > 
> > Ignore the above. Brainfart.
> >   
> > > I'm struggling to follow this suggestion
> > > Would you expect the cxl_pci MSI parent domain to set it's parent as
> > > msi_domain = irq_domain_create_hierarchy(dev_get_msi_domain(&pdev->dev),
> > > 					 IRQ_DOMAIN_FLAG_MSI_PARENT,
> > > 					 ...
> > > which seems to cause a problem with deadlocks in __irq_domain_alloc_irqs()
> > > or create a separate domain structure and provide callbacks that reach into
> > > the parent domain as necessary?
> > >
> > > Or do I have this entirely wrong? I'm struggling to relate what existing
> > > code like PCI does to what I need to do here.    
> > 
> > dev_get_msi_domain(&pdev->dev) is a nightmare due to the 4 different
> > models we have:
> > 
> >    - Legacy has no domain
> > 
> >    - Non-hierarchical domain
> > 
> >    - Hierarchical domain v1
> > 
> >          That's the global PCI/MSI domain
> > 
> >    - Hierarchical domain v2
> > 
> >       That's the underlying MSI_PARENT domain, which is on x86
> >       either the interrupt remap unit or the vector domain. On arm64
> >       it's the ITS domain.
> > 
> > See e.g. pci_msi_domain_supports() which handles the mess.
> > 
> > Now this proposal will only work on hierarchical domain v2 because that
> > can do the post enable allocations on MSI-X. Let's put a potential
> > solution for MSI aside for now to avoid complexity explosion.
> > 
> > So there are two ways to solve this:
> > 
> >   1) Implement the CXL MSI parent domain as disjunct domain
> > 
> >   2) Teach the V2 per device MSI-X domain to be a parent domain
> > 
> > #1 Looks pretty straight forward, but does not seemlessly fit the
> >    hierarchical domain model and creates horrible indirections.
> > 
> > #2 Is the proper abstraction, but requires to teach the MSI core code
> >    about stacked MSI parent domains, which should not be horribly hard
> >    or maybe just works out of the box.
> > 
> > The PCI device driver invokes the not yet existing function
> > pci_enable_msi_parent_domain(pci_dev). This function does:
> > 
> >   A) validate that this PCI device has a V2 parent domain
> > 
> >   B) validate that the device has enabled MSI-X
> > 
> >   C) validate that the PCI/MSI-X domain supports dynamic MSI-X
> >      allocation
> > 
> >   D) if #A to #C are true, it enables the PCI device parent domain
> > 
> > I made #B a prerequisite for now, as that's an orthogonal problem, which
> > does not need to be solved upfront. Maybe it does not need to be solved
> > at all because the underlying PCI driver always allocates a management
> > interrupt before dealing with the underlying "bus", which is IMHO a
> > reasonable expectation. At least it's a reasonable restriction for
> > getting started.
> > 
> > That function does:
> > 
> >      msix_dom = pci_msi_get_msix_domain(pdev);
> >      msix_dom->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT;
> >      msix_dom->msi_parent_ops = &pci_msix_parent_ops;
> > 
> > When creating the CXL devices the CXL code invokes
> > pci_msi_init_subdevice(pdev, &cxldev->device), which just does:
> > 
> >   dev_set_msi_domain(dev, pci_msi_get_msix_subdevice_parent(pdev));
> > 
> > That allows the CXL devices to create their own per device MSI
> > domain via a new function pci_msi_create_subdevice_irq_domain().
> > 
> > That function can just use a variant of pci_create_device_domain() with
> > a different domain template and a different irqchip, where the irqchip
> > just redirects to the underlying parent domain chip, aka PCI-MSI-X.
> > 
> > I don't think the CXL device will require a special chip as all they
> > should need to know is the linux interrupt number. If there are special
> > requirements then we can bring the IMS code back or do something similar
> > to the platform MSI code. 
> > 
> > Then you need pci_subdev_msix_alloc_at() and pci_subdev_msix_free()
> > which are the only two functions which the CXL (or similar) need.
> > 
> > The existing pci_msi*() API function might need a safety check so they
> > can't be abused by subdevices, but that's a problem for a final
> > solution.
> > 
> > That's pretty much all what it takes. Hope that helps.  
> 
> Absolutely!  Thanks for providing the detailed suggestion
> this definitely smells a lot less nasty than previous approach.
> 
> I have things sort of working now but it's ugly code with a few
> cross layer hacks that need tidying up (around programming the
> msi registers from wrong 'device'), so may be a little
> while before I get it in a state to post.

Hi Thomas,

My first solution had some callbacks where we created a local
descriptor so that I could swap in the pci_dev->dev and
just use the various standard pci/msi/irqdomain.c functions.

The 'minimal' solution seems to be a bit ugly.
`
There are quite a few places that make the assumption that the
preirq_data->parent->chip is the right chip to for example call
irq_set_affinity on.

So the simple way to make it all work is to just have
a msi_domain_template->msi_domain_ops->prepare_desc
that sets the desc->dev = to the parent device (here the
pci_dev->dev)

At that point everything more or less just works and all the
rest of the callbacks can use generic forms.

Alternative might be to make calls like the one in
arch/x86/kernel/apic/msi.c msi_set_affinity search
for the first ancestor device that has an irq_set_affinity().
That unfortunately may affect quite a few places.

Anyhow, I'll probably send out the prepare_desc hack set with
driver usage etc after I've written up a proper description of problems encountered
etc so you can see what it all looks like and will be more palatable in
general but in the  meantime this is the guts of it of the variant where the
subdev related desc has the dev set to the parent device.

Note for the avoidance of doubt, I don't much like this
solution but maybe it will grow on me ;)

Jonathan



From eb5761b1cb7b6278c86c836ae552982621c3504e Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Date: Tue, 10 Sep 2024 17:10:09 +0100
Subject: [PATCH 1/1] pci: msi: Add subdevice irq domains for dynamic MSI-X

PoC code only. All sorts of missing checks etc.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 arch/x86/kernel/apic/msi.c  |  3 ++
 drivers/pci/msi/api.c       | 14 ++++++
 drivers/pci/msi/irqdomain.c | 87 +++++++++++++++++++++++++++++++++++++
 include/linux/msi.h         |  5 +++
 include/linux/pci.h         |  3 ++
 kernel/irq/msi.c            |  5 +++
 6 files changed, 117 insertions(+)

diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
index 340769242dea..8ab4391d4559 100644
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -218,6 +218,9 @@ static bool x86_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
 	case DOMAIN_BUS_DMAR:
 	case DOMAIN_BUS_AMDVI:
 		break;
+	case DOMAIN_BUS_PCI_DEVICE_MSIX:
+		/* Currently needed just for the PCI MSI-X subdevice handling */
+		break;
 	default:
 		WARN_ON_ONCE(1);
 		return false;
diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c
index b956ce591f96..2b4c15102671 100644
--- a/drivers/pci/msi/api.c
+++ b/drivers/pci/msi/api.c
@@ -179,6 +179,20 @@ void pci_msix_free_irq(struct pci_dev *dev, struct msi_map map)
 }
 EXPORT_SYMBOL_GPL(pci_msix_free_irq);
 
+struct msi_map pci_subdev_msix_alloc_irq_at(struct device *dev, unsigned int index,
+					const struct irq_affinity_desc *affdesc)
+{
+	//missing sanity checks
+	return msi_domain_alloc_irq_at(dev, MSI_DEFAULT_DOMAIN, index, affdesc, NULL);
+}
+EXPORT_SYMBOL_GPL(pci_subdev_msix_alloc_irq_at);
+
+void pci_subdev_msix_free_irq(struct device *dev, struct msi_map map)
+{
+	msi_domain_free_irqs_range(dev, MSI_DEFAULT_DOMAIN, map.index, map.index);
+}
+EXPORT_SYMBOL_GPL(pci_subdev_msix_free_irq);
+
 /**
  * pci_disable_msix() - Disable MSI-X interrupt mode on device
  * @dev: the PCI device to operate on
diff --git a/drivers/pci/msi/irqdomain.c b/drivers/pci/msi/irqdomain.c
index 569125726b3e..48357a8054ff 100644
--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -444,3 +444,90 @@ struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev)
 					     DOMAIN_BUS_PCI_MSI);
 	return dom;
 }
+
+static const struct msi_parent_ops pci_msix_parent_ops = {
+	.supported_flags = MSI_FLAG_PCI_MSIX | MSI_FLAG_PCI_MSIX_ALLOC_DYN,
+	.prefix = "PCI-MSIX-PROXY-",
+	.init_dev_msi_info = msi_parent_init_dev_msi_info,
+};
+
+int pci_msi_enable_parent_domain(struct pci_dev *pdev)
+{
+	struct irq_domain *msix_dom;
+	/* TGLX: Validate has v2 parent domain */
+	/* TGLX: validate msix enabled */
+	/* TGLX: Validate msix domain supports dynamics msi-x */
+
+	/* Enable PCI device parent domain */
+	msix_dom = dev_msi_get_msix_device_domain(&pdev->dev);
+	msix_dom->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT;
+	msix_dom->msi_parent_ops = &pci_msix_parent_ops;
+	return 0;
+}
+
+void pci_msi_init_subdevice(struct pci_dev *pdev, struct device *dev)
+{
+	dev_set_msi_domain(dev, dev_msi_get_msix_device_domain(&pdev->dev));
+}
+
+static bool pci_subdev_create_device_domain(struct device *dev, const struct msi_domain_template *tmpl,
+				     unsigned int hwsize)
+{
+	struct irq_domain *domain = dev_get_msi_domain(dev);
+
+	if (!domain || !irq_domain_is_msi_parent(domain))
+		return true;
+
+	return msi_create_device_irq_domain(dev, MSI_DEFAULT_DOMAIN, tmpl,
+					    hwsize, NULL, NULL);
+}
+
+static void pci_subdev_msix_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg,
+					struct msi_desc *desc)
+{
+	struct device *parent = desc->dev->parent;
+
+	if (!desc->pci.mask_base) {
+		/* Not elegant - but needed for irq affinity to work? */
+		desc->dev = parent;
+		msix_prepare_msi_desc(to_pci_dev(parent), desc);
+	}
+}
+
+static const struct msi_domain_template subdev_template = {
+	.chip = {
+		.name = "SUBDEV",
+		.irq_mask = irq_chip_unmask_parent,
+		.irq_unmask = irq_chip_unmask_parent,
+		.irq_write_msi_msg = pci_msi_domain_write_msg,
+		.irq_set_affinity = irq_chip_set_affinity_parent,
+		.flags = IRQCHIP_ONESHOT_SAFE,
+	},
+	.ops = {
+		/*
+		 * RFC: Sets the desc->dev to the parent PCI device
+		 *       Needed for
+		 *       irq_setup_affinity() ->
+		 *          msi_set_affinity() ->
+		 *             parent = irq_d->parent_data;
+		 *             parent->chip->irq_set_affinity() to work.
+		 *      That could be made more flexible perhaps as
+		 *      currently it makes assumption that parent of
+		 *      the MSI device is the one to set the affinity on.
+		 */
+		.prepare_desc = pci_subdev_msix_prepare_desc,
+		/* Works because the desc->dev is the parent PCI device */
+		.set_desc = pci_msi_domain_set_desc,
+	},
+	.info = {
+		.flags = MSI_FLAG_PCI_MSIX | MSI_FLAG_PCI_MSIX_ALLOC_DYN |
+		MSI_FLAG_USE_DEF_CHIP_OPS | MSI_FLAG_USE_DEF_DOM_OPS,
+		.bus_token = DOMAIN_BUS_PCI_DEVICE_MSIX,
+	},
+};
+
+bool pci_subdev_setup_device_domain(struct device *dev, unsigned int hwsize)
+{
+	return pci_subdev_create_device_domain(dev, &subdev_template, hwsize);
+}
+EXPORT_SYMBOL_GPL(pci_subdev_setup_device_domain);
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 944979763825..ff81b4dcc1d9 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -656,8 +656,13 @@ void pci_msi_unmask_irq(struct irq_data *data);
 struct irq_domain *pci_msi_create_irq_domain(struct fwnode_handle *fwnode,
 					     struct msi_domain_info *info,
 					     struct irq_domain *parent);
+int pci_msi_enable_parent_domain(struct pci_dev *pdev);
+struct irq_domain *pci_msi_get_msix_device_domain(struct device *dev);
 u32 pci_msi_domain_get_msi_rid(struct irq_domain *domain, struct pci_dev *pdev);
 struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev);
+struct irq_domain *dev_msi_get_msix_device_domain(struct device *dev);
+void pci_msi_init_subdevice(struct pci_dev *pdev, struct device *dev);
+bool pci_subdev_setup_device_domain(struct device *dev, unsigned int hwsize);
 #else /* CONFIG_PCI_MSI */
 static inline struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev)
 {
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 225de9be3006..460551f1bd6e 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1679,6 +1679,9 @@ struct msi_map pci_msix_alloc_irq_at(struct pci_dev *dev, unsigned int index,
 				     const struct irq_affinity_desc *affdesc);
 void pci_msix_free_irq(struct pci_dev *pdev, struct msi_map map);
 
+struct msi_map pci_subdev_msix_alloc_irq_at(struct device *dev, unsigned int index,
+					const struct irq_affinity_desc *affdesc);
+void pci_subdev_msix_free_irq(struct device *dev, struct msi_map map);
 void pci_free_irq_vectors(struct pci_dev *dev);
 int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
 const struct cpumask *pci_irq_get_affinity(struct pci_dev *pdev, int vec);
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index 5fa0547ece0c..d55a91c7a496 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -1720,3 +1720,8 @@ bool msi_device_has_isolated_msi(struct device *dev)
 	return arch_is_isolated_msi();
 }
 EXPORT_SYMBOL_GPL(msi_device_has_isolated_msi);
+struct irq_domain *dev_msi_get_msix_device_domain(struct device *dev)
+{
+	return dev->msi.data->__domains[0].domain;
+}
+
Jonathan Cameron Sept. 10, 2024, 5:37 p.m. UTC | #14
On Tue, 10 Sep 2024 17:47:43 +0100
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

> On Fri, 6 Sep 2024 18:18:32 +0100
> Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> 
> > On Fri, 06 Sep 2024 12:11:36 +0200
> > Thomas Gleixner <tglx@linutronix.de> wrote:
> >   
> > > On Thu, Sep 05 2024 at 12:23, Jonathan Cameron wrote:    
> > > >> Look at how the PCI device manages interrupts with the per device MSI
> > > >> mechanism. Forget doing this with either one of the legacy mechanisms.
> > > >> 
> > > >>   1) It creates a hierarchical interrupt domain and gets the required
> > > >>      resources from the provided parent domain. The PCI side does not
> > > >>      care whether this is x86 or arm64 and it neither cares whether the
> > > >>      parent domain does remapping or not. The only way it cares is about
> > > >>      the features supported by the different parent domains (think
> > > >>      multi-MSI).
> > > >>      
> > > >>   2) Driver side allocations go through the per device domain
> > > >> 
> > > >> That AUXbus is not any different. When the CPMU driver binds it wants to
> > > >> allocate interrupts. So instead of having a convoluted construct
> > > >> reaching into the parent PCI device, you simply can do:
> > > >> 
> > > >>   1) Let the cxl_pci driver create a MSI parent domain and set that in
> > > >>      the subdevice::msi::irqdomain pointer.
> > > >> 
> > > >>   2) Provide cxl_aux_bus_create_irqdomain() which allows the CPMU device
> > > >>      driver to create a per device interrupt domain.
> > > >> 
> > > >>   3) Let the CPMU driver create its per device interrupt domain with
> > > >>      the provided parent domain
> > > >> 
> > > >>   4) Let the CPMU driver allocate its MSI-X interrupts through the per
> > > >>      device domain
> > > >> 
> > > >> Now on the cxl_pci side the AUXbus parent interrupt domain allocation
> > > >> function does:
> > > >> 
> > > >>     if (!pci_msix_enabled(pdev))
> > > >>     	return pci_msix_enable_and_alloc_irq_at(pdev, ....);
> > > >>     else
> > > >>         return pci_msix_alloc_irq_at(pdev, ....);      
> > > 
> > > Ignore the above. Brainfart.
> > >     
> > > > I'm struggling to follow this suggestion
> > > > Would you expect the cxl_pci MSI parent domain to set it's parent as
> > > > msi_domain = irq_domain_create_hierarchy(dev_get_msi_domain(&pdev->dev),
> > > > 					 IRQ_DOMAIN_FLAG_MSI_PARENT,
> > > > 					 ...
> > > > which seems to cause a problem with deadlocks in __irq_domain_alloc_irqs()
> > > > or create a separate domain structure and provide callbacks that reach into
> > > > the parent domain as necessary?
> > > >
> > > > Or do I have this entirely wrong? I'm struggling to relate what existing
> > > > code like PCI does to what I need to do here.      
> > > 
> > > dev_get_msi_domain(&pdev->dev) is a nightmare due to the 4 different
> > > models we have:
> > > 
> > >    - Legacy has no domain
> > > 
> > >    - Non-hierarchical domain
> > > 
> > >    - Hierarchical domain v1
> > > 
> > >          That's the global PCI/MSI domain
> > > 
> > >    - Hierarchical domain v2
> > > 
> > >       That's the underlying MSI_PARENT domain, which is on x86
> > >       either the interrupt remap unit or the vector domain. On arm64
> > >       it's the ITS domain.
> > > 
> > > See e.g. pci_msi_domain_supports() which handles the mess.
> > > 
> > > Now this proposal will only work on hierarchical domain v2 because that
> > > can do the post enable allocations on MSI-X. Let's put a potential
> > > solution for MSI aside for now to avoid complexity explosion.
> > > 
> > > So there are two ways to solve this:
> > > 
> > >   1) Implement the CXL MSI parent domain as disjunct domain
> > > 
> > >   2) Teach the V2 per device MSI-X domain to be a parent domain
> > > 
> > > #1 Looks pretty straight forward, but does not seemlessly fit the
> > >    hierarchical domain model and creates horrible indirections.
> > > 
> > > #2 Is the proper abstraction, but requires to teach the MSI core code
> > >    about stacked MSI parent domains, which should not be horribly hard
> > >    or maybe just works out of the box.
> > > 
> > > The PCI device driver invokes the not yet existing function
> > > pci_enable_msi_parent_domain(pci_dev). This function does:
> > > 
> > >   A) validate that this PCI device has a V2 parent domain
> > > 
> > >   B) validate that the device has enabled MSI-X
> > > 
> > >   C) validate that the PCI/MSI-X domain supports dynamic MSI-X
> > >      allocation
> > > 
> > >   D) if #A to #C are true, it enables the PCI device parent domain
> > > 
> > > I made #B a prerequisite for now, as that's an orthogonal problem, which
> > > does not need to be solved upfront. Maybe it does not need to be solved
> > > at all because the underlying PCI driver always allocates a management
> > > interrupt before dealing with the underlying "bus", which is IMHO a
> > > reasonable expectation. At least it's a reasonable restriction for
> > > getting started.
> > > 
> > > That function does:
> > > 
> > >      msix_dom = pci_msi_get_msix_domain(pdev);
> > >      msix_dom->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT;
> > >      msix_dom->msi_parent_ops = &pci_msix_parent_ops;
> > > 
> > > When creating the CXL devices the CXL code invokes
> > > pci_msi_init_subdevice(pdev, &cxldev->device), which just does:
> > > 
> > >   dev_set_msi_domain(dev, pci_msi_get_msix_subdevice_parent(pdev));
> > > 
> > > That allows the CXL devices to create their own per device MSI
> > > domain via a new function pci_msi_create_subdevice_irq_domain().
> > > 
> > > That function can just use a variant of pci_create_device_domain() with
> > > a different domain template and a different irqchip, where the irqchip
> > > just redirects to the underlying parent domain chip, aka PCI-MSI-X.
> > > 
> > > I don't think the CXL device will require a special chip as all they
> > > should need to know is the linux interrupt number. If there are special
> > > requirements then we can bring the IMS code back or do something similar
> > > to the platform MSI code. 
> > > 
> > > Then you need pci_subdev_msix_alloc_at() and pci_subdev_msix_free()
> > > which are the only two functions which the CXL (or similar) need.
> > > 
> > > The existing pci_msi*() API function might need a safety check so they
> > > can't be abused by subdevices, but that's a problem for a final
> > > solution.
> > > 
> > > That's pretty much all what it takes. Hope that helps.    
> > 
> > Absolutely!  Thanks for providing the detailed suggestion
> > this definitely smells a lot less nasty than previous approach.
> > 
> > I have things sort of working now but it's ugly code with a few
> > cross layer hacks that need tidying up (around programming the
> > msi registers from wrong 'device'), so may be a little
> > while before I get it in a state to post.  
> 
> Hi Thomas,

I forgot to say, for anyone reading this, don't spend to much time on it yet!
The whole topic area of refactoring portdrv is up for discussion
in the PCI related uconf at plumbers next week.

Whilst I quite like this approach we might end up going in a totally
different direction due to legacy and other concerns :(

Jonathan

> 
> My first solution had some callbacks where we created a local
> descriptor so that I could swap in the pci_dev->dev and
> just use the various standard pci/msi/irqdomain.c functions.
> 
> The 'minimal' solution seems to be a bit ugly.
> `
> There are quite a few places that make the assumption that the
> preirq_data->parent->chip is the right chip to for example call
> irq_set_affinity on.
> 
> So the simple way to make it all work is to just have
> a msi_domain_template->msi_domain_ops->prepare_desc
> that sets the desc->dev = to the parent device (here the
> pci_dev->dev)
> 
> At that point everything more or less just works and all the
> rest of the callbacks can use generic forms.
> 
> Alternative might be to make calls like the one in
> arch/x86/kernel/apic/msi.c msi_set_affinity search
> for the first ancestor device that has an irq_set_affinity().
> That unfortunately may affect quite a few places.
> 
> Anyhow, I'll probably send out the prepare_desc hack set with
> driver usage etc after I've written up a proper description of problems encountered
> etc so you can see what it all looks like and will be more palatable in
> general but in the  meantime this is the guts of it of the variant where the
> subdev related desc has the dev set to the parent device.
> 
> Note for the avoidance of doubt, I don't much like this
> solution but maybe it will grow on me ;)
> 
> Jonathan
> 
> 
> 
> From eb5761b1cb7b6278c86c836ae552982621c3504e Mon Sep 17 00:00:00 2001
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Date: Tue, 10 Sep 2024 17:10:09 +0100
> Subject: [PATCH 1/1] pci: msi: Add subdevice irq domains for dynamic MSI-X
> 
> PoC code only. All sorts of missing checks etc.
> 
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
>  arch/x86/kernel/apic/msi.c  |  3 ++
>  drivers/pci/msi/api.c       | 14 ++++++
>  drivers/pci/msi/irqdomain.c | 87 +++++++++++++++++++++++++++++++++++++
>  include/linux/msi.h         |  5 +++
>  include/linux/pci.h         |  3 ++
>  kernel/irq/msi.c            |  5 +++
>  6 files changed, 117 insertions(+)
> 
> diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
> index 340769242dea..8ab4391d4559 100644
> --- a/arch/x86/kernel/apic/msi.c
> +++ b/arch/x86/kernel/apic/msi.c
> @@ -218,6 +218,9 @@ static bool x86_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
>  	case DOMAIN_BUS_DMAR:
>  	case DOMAIN_BUS_AMDVI:
>  		break;
> +	case DOMAIN_BUS_PCI_DEVICE_MSIX:
> +		/* Currently needed just for the PCI MSI-X subdevice handling */
> +		break;
>  	default:
>  		WARN_ON_ONCE(1);
>  		return false;
> diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c
> index b956ce591f96..2b4c15102671 100644
> --- a/drivers/pci/msi/api.c
> +++ b/drivers/pci/msi/api.c
> @@ -179,6 +179,20 @@ void pci_msix_free_irq(struct pci_dev *dev, struct msi_map map)
>  }
>  EXPORT_SYMBOL_GPL(pci_msix_free_irq);
>  
> +struct msi_map pci_subdev_msix_alloc_irq_at(struct device *dev, unsigned int index,
> +					const struct irq_affinity_desc *affdesc)
> +{
> +	//missing sanity checks
> +	return msi_domain_alloc_irq_at(dev, MSI_DEFAULT_DOMAIN, index, affdesc, NULL);
> +}
> +EXPORT_SYMBOL_GPL(pci_subdev_msix_alloc_irq_at);
> +
> +void pci_subdev_msix_free_irq(struct device *dev, struct msi_map map)
> +{
> +	msi_domain_free_irqs_range(dev, MSI_DEFAULT_DOMAIN, map.index, map.index);
> +}
> +EXPORT_SYMBOL_GPL(pci_subdev_msix_free_irq);
> +
>  /**
>   * pci_disable_msix() - Disable MSI-X interrupt mode on device
>   * @dev: the PCI device to operate on
> diff --git a/drivers/pci/msi/irqdomain.c b/drivers/pci/msi/irqdomain.c
> index 569125726b3e..48357a8054ff 100644
> --- a/drivers/pci/msi/irqdomain.c
> +++ b/drivers/pci/msi/irqdomain.c
> @@ -444,3 +444,90 @@ struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev)
>  					     DOMAIN_BUS_PCI_MSI);
>  	return dom;
>  }
> +
> +static const struct msi_parent_ops pci_msix_parent_ops = {
> +	.supported_flags = MSI_FLAG_PCI_MSIX | MSI_FLAG_PCI_MSIX_ALLOC_DYN,
> +	.prefix = "PCI-MSIX-PROXY-",
> +	.init_dev_msi_info = msi_parent_init_dev_msi_info,
> +};
> +
> +int pci_msi_enable_parent_domain(struct pci_dev *pdev)
> +{
> +	struct irq_domain *msix_dom;
> +	/* TGLX: Validate has v2 parent domain */
> +	/* TGLX: validate msix enabled */
> +	/* TGLX: Validate msix domain supports dynamics msi-x */
> +
> +	/* Enable PCI device parent domain */
> +	msix_dom = dev_msi_get_msix_device_domain(&pdev->dev);
> +	msix_dom->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT;
> +	msix_dom->msi_parent_ops = &pci_msix_parent_ops;
> +	return 0;
> +}
> +
> +void pci_msi_init_subdevice(struct pci_dev *pdev, struct device *dev)
> +{
> +	dev_set_msi_domain(dev, dev_msi_get_msix_device_domain(&pdev->dev));
> +}
> +
> +static bool pci_subdev_create_device_domain(struct device *dev, const struct msi_domain_template *tmpl,
> +				     unsigned int hwsize)
> +{
> +	struct irq_domain *domain = dev_get_msi_domain(dev);
> +
> +	if (!domain || !irq_domain_is_msi_parent(domain))
> +		return true;
> +
> +	return msi_create_device_irq_domain(dev, MSI_DEFAULT_DOMAIN, tmpl,
> +					    hwsize, NULL, NULL);
> +}
> +
> +static void pci_subdev_msix_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg,
> +					struct msi_desc *desc)
> +{
> +	struct device *parent = desc->dev->parent;
> +
> +	if (!desc->pci.mask_base) {
> +		/* Not elegant - but needed for irq affinity to work? */
> +		desc->dev = parent;
> +		msix_prepare_msi_desc(to_pci_dev(parent), desc);
> +	}
> +}
> +
> +static const struct msi_domain_template subdev_template = {
> +	.chip = {
> +		.name = "SUBDEV",
> +		.irq_mask = irq_chip_unmask_parent,
> +		.irq_unmask = irq_chip_unmask_parent,
> +		.irq_write_msi_msg = pci_msi_domain_write_msg,
> +		.irq_set_affinity = irq_chip_set_affinity_parent,
> +		.flags = IRQCHIP_ONESHOT_SAFE,
> +	},
> +	.ops = {
> +		/*
> +		 * RFC: Sets the desc->dev to the parent PCI device
> +		 *       Needed for
> +		 *       irq_setup_affinity() ->
> +		 *          msi_set_affinity() ->
> +		 *             parent = irq_d->parent_data;
> +		 *             parent->chip->irq_set_affinity() to work.
> +		 *      That could be made more flexible perhaps as
> +		 *      currently it makes assumption that parent of
> +		 *      the MSI device is the one to set the affinity on.
> +		 */
> +		.prepare_desc = pci_subdev_msix_prepare_desc,
> +		/* Works because the desc->dev is the parent PCI device */
> +		.set_desc = pci_msi_domain_set_desc,
> +	},
> +	.info = {
> +		.flags = MSI_FLAG_PCI_MSIX | MSI_FLAG_PCI_MSIX_ALLOC_DYN |
> +		MSI_FLAG_USE_DEF_CHIP_OPS | MSI_FLAG_USE_DEF_DOM_OPS,
> +		.bus_token = DOMAIN_BUS_PCI_DEVICE_MSIX,
> +	},
> +};
> +
> +bool pci_subdev_setup_device_domain(struct device *dev, unsigned int hwsize)
> +{
> +	return pci_subdev_create_device_domain(dev, &subdev_template, hwsize);
> +}
> +EXPORT_SYMBOL_GPL(pci_subdev_setup_device_domain);
> diff --git a/include/linux/msi.h b/include/linux/msi.h
> index 944979763825..ff81b4dcc1d9 100644
> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -656,8 +656,13 @@ void pci_msi_unmask_irq(struct irq_data *data);
>  struct irq_domain *pci_msi_create_irq_domain(struct fwnode_handle *fwnode,
>  					     struct msi_domain_info *info,
>  					     struct irq_domain *parent);
> +int pci_msi_enable_parent_domain(struct pci_dev *pdev);
> +struct irq_domain *pci_msi_get_msix_device_domain(struct device *dev);
>  u32 pci_msi_domain_get_msi_rid(struct irq_domain *domain, struct pci_dev *pdev);
>  struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev);
> +struct irq_domain *dev_msi_get_msix_device_domain(struct device *dev);
> +void pci_msi_init_subdevice(struct pci_dev *pdev, struct device *dev);
> +bool pci_subdev_setup_device_domain(struct device *dev, unsigned int hwsize);
>  #else /* CONFIG_PCI_MSI */
>  static inline struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev)
>  {
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 225de9be3006..460551f1bd6e 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1679,6 +1679,9 @@ struct msi_map pci_msix_alloc_irq_at(struct pci_dev *dev, unsigned int index,
>  				     const struct irq_affinity_desc *affdesc);
>  void pci_msix_free_irq(struct pci_dev *pdev, struct msi_map map);
>  
> +struct msi_map pci_subdev_msix_alloc_irq_at(struct device *dev, unsigned int index,
> +					const struct irq_affinity_desc *affdesc);
> +void pci_subdev_msix_free_irq(struct device *dev, struct msi_map map);
>  void pci_free_irq_vectors(struct pci_dev *dev);
>  int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
>  const struct cpumask *pci_irq_get_affinity(struct pci_dev *pdev, int vec);
> diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
> index 5fa0547ece0c..d55a91c7a496 100644
> --- a/kernel/irq/msi.c
> +++ b/kernel/irq/msi.c
> @@ -1720,3 +1720,8 @@ bool msi_device_has_isolated_msi(struct device *dev)
>  	return arch_is_isolated_msi();
>  }
>  EXPORT_SYMBOL_GPL(msi_device_has_isolated_msi);
> +struct irq_domain *dev_msi_get_msix_device_domain(struct device *dev)
> +{
> +	return dev->msi.data->__domains[0].domain;
> +}
> +
Thomas Gleixner Sept. 10, 2024, 8:04 p.m. UTC | #15
On Tue, Sep 10 2024 at 17:47, Jonathan Cameron wrote:
> There are quite a few places that make the assumption that the
> preirq_data->parent->chip is the right chip to for example call
> irq_set_affinity on.
>
> So the simple way to make it all work is to just have
> a msi_domain_template->msi_domain_ops->prepare_desc
> that sets the desc->dev = to the parent device (here the
> pci_dev->dev)

Creative :)

> At that point everything more or less just works and all the
> rest of the callbacks can use generic forms.
>
> Alternative might be to make calls like the one in
> arch/x86/kernel/apic/msi.c msi_set_affinity search
> for the first ancestor device that has an irq_set_affinity().
> That unfortunately may affect quite a few places.

What's the problem? None of the CXL drivers should care about that at
all. Delegating other callbacks to the parent domain is what hierarchical
domains are about. If there is some helper missing, so we add it.

> Anyhow, I'll probably send out the prepare_desc hack set with driver
> usage etc after I've written up a proper description of problems
> encountered etc so you can see what it all looks like and will be more
> palatable in general but in the meantime this is the guts of it of the
> variant where the subdev related desc has the dev set to the parent
> device.

Let me look at this pile then :)

> Note for the avoidance of doubt, I don't much like this
> solution but maybe it will grow on me ;)

Maybe, but I doubt it will survive my abstraction taste check.

> +static const struct msi_parent_ops pci_msix_parent_ops = {
> +	.supported_flags = MSI_FLAG_PCI_MSIX | MSI_FLAG_PCI_MSIX_ALLOC_DYN,
> +	.prefix = "PCI-MSIX-PROXY-",
> +	.init_dev_msi_info = msi_parent_init_dev_msi_info,

The obligatory link to:

https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#struct-declarations-and-initializers

> +};
> +
> +int pci_msi_enable_parent_domain(struct pci_dev *pdev)
> +{
> +	struct irq_domain *msix_dom;
> +	/* TGLX: Validate has v2 parent domain */
> +	/* TGLX: validate msix enabled */
> +	/* TGLX: Validate msix domain supports dynamics msi-x */
> +
> +	/* Enable PCI device parent domain */
> +	msix_dom = dev_msi_get_msix_device_domain(&pdev->dev);
> +	msix_dom->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT;
> +	msix_dom->msi_parent_ops = &pci_msix_parent_ops;

That want's to be a call into the MSI core code, Something like:

     msi_enable_sub_parent_domain(&pdev->dev, DOMAIN_BUS_PCI_DEVICE_MSIX)

or something like that. I really don't want to expose the internals of
MSI. I spent a lot of effort to encapsulate that...

> +void pci_msi_init_subdevice(struct pci_dev *pdev, struct device *dev)
> +{
> +	dev_set_msi_domain(dev, dev_msi_get_msix_device_domain(&pdev->dev));

     msi_set_parent_domain(.....);

> +static void pci_subdev_msix_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg,
> +					struct msi_desc *desc)
> +{
> +	struct device *parent = desc->dev->parent;
> +
> +	if (!desc->pci.mask_base) {
> +		/* Not elegant - but needed for irq affinity to work? */

Which makes me ask the question why desc->pci.mask_base is NULL in the
first place. That needs some thought at the place which initializes the
descriptor.

> --- a/kernel/irq/msi.c
> +++ b/kernel/irq/msi.c
> @@ -1720,3 +1720,8 @@ bool msi_device_has_isolated_msi(struct device *dev)
>  	return arch_is_isolated_msi();
>  }
>  EXPORT_SYMBOL_GPL(msi_device_has_isolated_msi);
> +struct irq_domain *dev_msi_get_msix_device_domain(struct device *dev)

You clearly run out of new lines here :)

> +{
> +	return dev->msi.data->__domains[0].domain;
> +}

But aside of that. No. See above.

Thanks,

        tglx
Jonathan Cameron Sept. 12, 2024, 4:37 p.m. UTC | #16
On Tue, 10 Sep 2024 22:04:18 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Tue, Sep 10 2024 at 17:47, Jonathan Cameron wrote:
> > There are quite a few places that make the assumption that the
> > preirq_data->parent->chip is the right chip to for example call
> > irq_set_affinity on.
> >
> > So the simple way to make it all work is to just have
> > a msi_domain_template->msi_domain_ops->prepare_desc
> > that sets the desc->dev = to the parent device (here the
> > pci_dev->dev)  
> 
> Creative :)

One bit that is challenging me is that the PCI access functions
use the pci_dev that is embedded via irq_data->common->msi_desc->dev
(and hence doesn't give us a obvious path to any hierarchical structures)
E.g. __pci_write_msi_msg() and which checks the pci_dev is
powered up.

I'd like to be able to do a call to the parent similar to irq_chip_unmask_parent
to handle write_msi_msg() for the new device domain.

So how should this be avoided or should msi_desc->dev be the
PCI device?

I can see some options but maybe I'm missing the big picture.
1) Stop them using that path to get to the device because it
  might not be the right one. Instead use device via irq_data->chip_data
  (which is currently unused for these cases).
  Some slightly fiddly work will be needed to handle changing __pci_write_msi_msg()
  to talk irq_data though.
2) Maybe setting the msi descriptor dev to the one we actually write
   on is the right solution? (smells bad to me but I'm still getting
   my head around this stuff).

Any immediate thoughts?  Or am I still thinking about this the wrong
way? (which is likely)

> 
> > At that point everything more or less just works and all the
> > rest of the callbacks can use generic forms.
> >
> > Alternative might be to make calls like the one in
> > arch/x86/kernel/apic/msi.c msi_set_affinity search
> > for the first ancestor device that has an irq_set_affinity().
> > That unfortunately may affect quite a few places.  
> 
> What's the problem? None of the CXL drivers should care about that at
> all. Delegating other callbacks to the parent domain is what hierarchical
> domains are about. If there is some helper missing, so we add it.
> 
> > Anyhow, I'll probably send out the prepare_desc hack set with driver
> > usage etc after I've written up a proper description of problems
> > encountered etc so you can see what it all looks like and will be more
> > palatable in general but in the meantime this is the guts of it of the
> > variant where the subdev related desc has the dev set to the parent
> > device.  
> 
> Let me look at this pile then :)
> 
> > Note for the avoidance of doubt, I don't much like this
> > solution but maybe it will grow on me ;)  
> 
> Maybe, but I doubt it will survive my abstraction taste check.
> 
> > +static const struct msi_parent_ops pci_msix_parent_ops = {
> > +	.supported_flags = MSI_FLAG_PCI_MSIX | MSI_FLAG_PCI_MSIX_ALLOC_DYN,
> > +	.prefix = "PCI-MSIX-PROXY-",
> > +	.init_dev_msi_info = msi_parent_init_dev_msi_info,  
> 
> The obligatory link to:
> 
> https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#struct-declarations-and-initializers
> 

Thanks. Will fix up before I get anywhere near a real posting!

> > +};
> > +
> > +int pci_msi_enable_parent_domain(struct pci_dev *pdev)
> > +{
> > +	struct irq_domain *msix_dom;
> > +	/* TGLX: Validate has v2 parent domain */
> > +	/* TGLX: validate msix enabled */
> > +	/* TGLX: Validate msix domain supports dynamics msi-x */
> > +
> > +	/* Enable PCI device parent domain */
> > +	msix_dom = dev_msi_get_msix_device_domain(&pdev->dev);
> > +	msix_dom->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT;
> > +	msix_dom->msi_parent_ops = &pci_msix_parent_ops;  
> 
> That want's to be a call into the MSI core code, Something like:
> 
>      msi_enable_sub_parent_domain(&pdev->dev, DOMAIN_BUS_PCI_DEVICE_MSIX)
> 
> or something like that. I really don't want to expose the internals of
> MSI. I spent a lot of effort to encapsulate that...
> 
> > +void pci_msi_init_subdevice(struct pci_dev *pdev, struct device *dev)
> > +{
> > +	dev_set_msi_domain(dev, dev_msi_get_msix_device_domain(&pdev->dev));  
> 
>      msi_set_parent_domain(.....);
> 
> > +static void pci_subdev_msix_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg,
> > +					struct msi_desc *desc)
> > +{
> > +	struct device *parent = desc->dev->parent;
> > +
> > +	if (!desc->pci.mask_base) {
> > +		/* Not elegant - but needed for irq affinity to work? */  
> 
> Which makes me ask the question why desc->pci.mask_base is NULL in the
> first place. That needs some thought at the place which initializes the
> descriptor.

Maybe this is the other way around?  The descriptor is 'never' initialized
for this case.  The equivalent code in msix_prepare_msi_desc() has
comments 
" This is separate from msix_setup_msi_descs() below to handle dynamic
 allocations for MSI-X after initial enablement."
So I think this !desc->pci.mask_base should always succeed 
(and so I should get rid of it and run the descriptor initialize unconditionally)

It might be appropriate to call the prepare_desc from the parent msi_domain though
via irq_domain->parent_domain->host_data
However this has the same need to not be based on the msi_desc->dev subject
to the question above.

Jonathan



> 
> > --- a/kernel/irq/msi.c
> > +++ b/kernel/irq/msi.c
> > @@ -1720,3 +1720,8 @@ bool msi_device_has_isolated_msi(struct device *dev)
> >  	return arch_is_isolated_msi();
> >  }
> >  EXPORT_SYMBOL_GPL(msi_device_has_isolated_msi);
> > +struct irq_domain *dev_msi_get_msix_device_domain(struct device *dev)  
> 
> You clearly run out of new lines here :)
> 
> > +{
> > +	return dev->msi.data->__domains[0].domain;
> > +}  
> 
> But aside of that. No. See above.
> 
> Thanks,
> 
>         tglx
>
Jonathan Cameron Sept. 12, 2024, 5:34 p.m. UTC | #17
On Thu, 12 Sep 2024 17:37:20 +0100
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

> On Tue, 10 Sep 2024 22:04:18 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > On Tue, Sep 10 2024 at 17:47, Jonathan Cameron wrote:
> > > There are quite a few places that make the assumption that the
> > > preirq_data->parent->chip is the right chip to for example call
> > > irq_set_affinity on.
> > >
> > > So the simple way to make it all work is to just have
> > > a msi_domain_template->msi_domain_ops->prepare_desc
> > > that sets the desc->dev = to the parent device (here the
> > > pci_dev->dev)  
> > 
> > Creative :)
> 
> One bit that is challenging me is that the PCI access functions
> use the pci_dev that is embedded via irq_data->common->msi_desc->dev
> (and hence doesn't give us a obvious path to any hierarchical structures)
> E.g. __pci_write_msi_msg() and which checks the pci_dev is
> powered up.
> 
> I'd like to be able to do a call to the parent similar to irq_chip_unmask_parent
> to handle write_msi_msg() for the new device domain.
> 
> So how should this be avoided or should msi_desc->dev be the
> PCI device?

I thought about this whilst cycling home.  Perhaps my fundamental
question is more basic  Which device should
msi_dec->dev be?
* The ultimate requester of the msi  -  so here the one calling
our new pci_msix_subdev_alloc_at(),
* The one on which the msi_write_msg() should occur. - here
  that would be the struct pci_dev given the checks needed on
  whether the device is powered up.  If this is the case, can
  we instead use the irq_data->dev in __pci_write_msi_msg()?

Also, is it valid to use the irq_domain->dev for callbacks such
as msi_prepare_desc on basis that (I think) is the device
that 'hosts' the irq domain, so can we use it to replace the
use of msi_desc->dev in pci_msix_prepare_desc()
If we can that enables our subdev to use a prepare_desc callback
that does something like

	struct msi_domain *info;

	domain = domain->parent;
	info = domain->host_data;

	info->ops->prepare_desc(domain, arg, desc);

Thanks for any pointers!

Jonathan


> 
> I can see some options but maybe I'm missing the big picture.
> 1) Stop them using that path to get to the device because it
>   might not be the right one. Instead use device via irq_data->chip_data
>   (which is currently unused for these cases).
>   Some slightly fiddly work will be needed to handle changing __pci_write_msi_msg()
>   to talk irq_data though.
> 2) Maybe setting the msi descriptor dev to the one we actually write
>    on is the right solution? (smells bad to me but I'm still getting
>    my head around this stuff).
> 
> Any immediate thoughts?  Or am I still thinking about this the wrong
> way? (which is likely)
> 
> > 
> > > At that point everything more or less just works and all the
> > > rest of the callbacks can use generic forms.
> > >
> > > Alternative might be to make calls like the one in
> > > arch/x86/kernel/apic/msi.c msi_set_affinity search
> > > for the first ancestor device that has an irq_set_affinity().
> > > That unfortunately may affect quite a few places.  
> > 
> > What's the problem? None of the CXL drivers should care about that at
> > all. Delegating other callbacks to the parent domain is what hierarchical
> > domains are about. If there is some helper missing, so we add it.
> > 
> > > Anyhow, I'll probably send out the prepare_desc hack set with driver
> > > usage etc after I've written up a proper description of problems
> > > encountered etc so you can see what it all looks like and will be more
> > > palatable in general but in the meantime this is the guts of it of the
> > > variant where the subdev related desc has the dev set to the parent
> > > device.  
> > 
> > Let me look at this pile then :)
> > 
> > > Note for the avoidance of doubt, I don't much like this
> > > solution but maybe it will grow on me ;)  
> > 
> > Maybe, but I doubt it will survive my abstraction taste check.
> > 
> > > +static const struct msi_parent_ops pci_msix_parent_ops = {
> > > +	.supported_flags = MSI_FLAG_PCI_MSIX | MSI_FLAG_PCI_MSIX_ALLOC_DYN,
> > > +	.prefix = "PCI-MSIX-PROXY-",
> > > +	.init_dev_msi_info = msi_parent_init_dev_msi_info,  
> > 
> > The obligatory link to:
> > 
> > https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#struct-declarations-and-initializers
> > 
> 
> Thanks. Will fix up before I get anywhere near a real posting!
> 
> > > +};
> > > +
> > > +int pci_msi_enable_parent_domain(struct pci_dev *pdev)
> > > +{
> > > +	struct irq_domain *msix_dom;
> > > +	/* TGLX: Validate has v2 parent domain */
> > > +	/* TGLX: validate msix enabled */
> > > +	/* TGLX: Validate msix domain supports dynamics msi-x */
> > > +
> > > +	/* Enable PCI device parent domain */
> > > +	msix_dom = dev_msi_get_msix_device_domain(&pdev->dev);
> > > +	msix_dom->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT;
> > > +	msix_dom->msi_parent_ops = &pci_msix_parent_ops;  
> > 
> > That want's to be a call into the MSI core code, Something like:
> > 
> >      msi_enable_sub_parent_domain(&pdev->dev, DOMAIN_BUS_PCI_DEVICE_MSIX)
> > 
> > or something like that. I really don't want to expose the internals of
> > MSI. I spent a lot of effort to encapsulate that...
> > 
> > > +void pci_msi_init_subdevice(struct pci_dev *pdev, struct device *dev)
> > > +{
> > > +	dev_set_msi_domain(dev, dev_msi_get_msix_device_domain(&pdev->dev));  
> > 
> >      msi_set_parent_domain(.....);
> > 
> > > +static void pci_subdev_msix_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg,
> > > +					struct msi_desc *desc)
> > > +{
> > > +	struct device *parent = desc->dev->parent;
> > > +
> > > +	if (!desc->pci.mask_base) {
> > > +		/* Not elegant - but needed for irq affinity to work? */  
> > 
> > Which makes me ask the question why desc->pci.mask_base is NULL in the
> > first place. That needs some thought at the place which initializes the
> > descriptor.
> 
> Maybe this is the other way around?  The descriptor is 'never' initialized
> for this case.  The equivalent code in msix_prepare_msi_desc() has
> comments 
> " This is separate from msix_setup_msi_descs() below to handle dynamic
>  allocations for MSI-X after initial enablement."
> So I think this !desc->pci.mask_base should always succeed 
> (and so I should get rid of it and run the descriptor initialize unconditionally)
> 
> It might be appropriate to call the prepare_desc from the parent msi_domain though
> via irq_domain->parent_domain->host_data
> However this has the same need to not be based on the msi_desc->dev subject
> to the question above.
> 
> Jonathan
> 
> 
> 
> > 
> > > --- a/kernel/irq/msi.c
> > > +++ b/kernel/irq/msi.c
> > > @@ -1720,3 +1720,8 @@ bool msi_device_has_isolated_msi(struct device *dev)
> > >  	return arch_is_isolated_msi();
> > >  }
> > >  EXPORT_SYMBOL_GPL(msi_device_has_isolated_msi);
> > > +struct irq_domain *dev_msi_get_msix_device_domain(struct device *dev)  
> > 
> > You clearly run out of new lines here :)
> > 
> > > +{
> > > +	return dev->msi.data->__domains[0].domain;
> > > +}  
> > 
> > But aside of that. No. See above.
> > 
> > Thanks,
> > 
> >         tglx
> > 
>
Thomas Gleixner Sept. 13, 2024, 4:24 p.m. UTC | #18
On Thu, Sep 12 2024 at 18:34, Jonathan Cameron wrote:
> On Thu, 12 Sep 2024 17:37:20 +0100
> Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
>> One bit that is challenging me is that the PCI access functions
>> use the pci_dev that is embedded via irq_data->common->msi_desc->dev
>> (and hence doesn't give us a obvious path to any hierarchical structures)
>> E.g. __pci_write_msi_msg() and which checks the pci_dev is
>> powered up.
>>
>> I'd like to be able to do a call to the parent similar to
>> irq_chip_unmask_parent to handle write_msi_msg() for the new device
>> domain.

That's what hierarchy is about.  I see your problem vs. the
msi_desc::dev though.

>> So how should this be avoided or should msi_desc->dev be the
>> PCI device?

See below.

> I thought about this whilst cycling home.  Perhaps my fundamental
> question is more basic  Which device should
> msi_dec->dev be?
> * The ultimate requester of the msi  -  so here the one calling
>   our new pci_msix_subdev_alloc_at(),
> * The one on which the msi_write_msg() should occur. - here
>   that would be the struct pci_dev given the checks needed on
>   whether the device is powered up.  If this is the case, can
>   we instead use the irq_data->dev in __pci_write_msi_msg()?

Only for per device MSI domains and you need to look irq data up, so
leave that alone.

> Also, is it valid to use the irq_domain->dev for callbacks such
> as msi_prepare_desc on basis that (I think) is the device
> that 'hosts' the irq domain, so can we use it to replace the
> use of msi_desc->dev in pci_msix_prepare_desc()
> If we can that enables our subdev to use a prepare_desc callback
> that does something like
>
> 	struct msi_domain *info;
>
> 	domain = domain->parent;
> 	info = domain->host_data;
>
> 	info->ops->prepare_desc(domain, arg, desc);

That should work, but you can simply do:

subdev_prepare_desc(domain, arg, desc)
{
        desc->dev = domain->parent->dev;
        pci_msix_prepare_desc(domain, arg, desc);
}

as this is a strict relationship in the PCI code.

This needs a validation that nothing relies on desc->dev being the same
as the device which allocated the descriptor:

  msi_desc_to_pci_dev(), msi_desc_to_dev() and a pile of open coded
  accesses.

The open coded access should be fixed and prevented by marking it
__private so people are forced to use the accessor functions.

If there is no reliance, we can just do that.

I checked the MSI core code already. Nothing to see there.

Thanks,

        tglx