Message ID | 1573040408-3831-4-git-send-email-jonathan.derrick@intel.com |
---|---|
State | Changes Requested |
Delegated to: | Lorenzo Pieralisi |
Headers | show |
Series | PCI: vmd: Reducing tail latency by affining to the storage stack | expand |
On Wed, Nov 06, 2019 at 04:40:08AM -0700, Jon Derrick wrote: > Using managed IRQ affinities sets up the VMD affinities identically to > the child devices when those devices vector counts are limited by VMD. > This promotes better affinity handling as interrupts won't necessarily > need to pass context between non-local CPUs. One pre-vector is reserved > for the slow interrupt and not considered in the affinity algorithm. This only works if all devices have exactly the same number of interrupts as the parent VMD host bridge. If a child device has less, the device will stop working if you offline a cpu: the child device may have a resource affined to other online cpus, but the VMD device affinity is to that single offline cpu.
On Thu, 2019-11-07 at 03:10 +0900, Keith Busch wrote: > On Wed, Nov 06, 2019 at 04:40:08AM -0700, Jon Derrick wrote: > > Using managed IRQ affinities sets up the VMD affinities identically to > > the child devices when those devices vector counts are limited by VMD. > > This promotes better affinity handling as interrupts won't necessarily > > need to pass context between non-local CPUs. One pre-vector is reserved > > for the slow interrupt and not considered in the affinity algorithm. > > This only works if all devices have exactly the same number of interrupts > as the parent VMD host bridge. If a child device has less, the device > will stop working if you offline a cpu: the child device may have a > resource affined to other online cpus, but the VMD device affinity is to > that single offline cpu. Yes that problem exists today and this set limits the exposure as it's a rare case where you have a child NVMe device with fewer than 32 vectors.
On Wed, Nov 06, 2019 at 08:14:41PM +0000, Derrick, Jonathan wrote: > Yes that problem exists today Not really, because we're currently using unamanged interrupts which migrate to online CPUs. It's only the managed ones that don't migrate because they have a unchangeable affinity. > and this set limits the exposure as it's > a rare case where you have a child NVMe device with fewer than 32 > vectors. I'm deeply skeptical that is the case. Even the P3700 has only 31 IO queues, yielding 31 vectors for IO services, so that already won't work with this scheme. But assuming you wanted to only use devices that have at least 32 IRQ vectors, the nvme driver also allows users to carve those vectors up into fully affinitized sets for different services (read vs. write is the one the block stack supports), which would also break if alignment to the parent device's IRQ setup is required.
On Thu, 2019-11-07 at 05:27 +0900, Keith Busch wrote: > On Wed, Nov 06, 2019 at 08:14:41PM +0000, Derrick, Jonathan wrote: > > Yes that problem exists today > > Not really, because we're currently using unamanged interrupts which > migrate to online CPUs. It's only the managed ones that don't migrate > because they have a unchangeable affinity. > > > and this set limits the exposure as it's > > a rare case where you have a child NVMe device with fewer than 32 > > vectors. > > I'm deeply skeptical that is the case. Even the P3700 has only 31 IO > queues, yielding 31 vectors for IO services, so that already won't work > with this scheme. > Darn you're right. At one point I had VMD vector 1 using NVMe AQ, leaving the 31 remaining VMD vectors for NVMe IO queues. This would have lined up perfectly. Had changed it last minute and that does break the scheme for P3700.... > But assuming you wanted to only use devices that have at least 32 IRQ > vectors, the nvme driver also allows users to carve those vectors up > into fully affinitized sets for different services (read vs. write is > the one the block stack supports), which would also break if alignment > to the parent device's IRQ setup is required. Wasn't aware of this. Fair enough.
On Wed, Nov 06, 2019 at 08:33:01PM +0000, Derrick, Jonathan wrote: > On Thu, 2019-11-07 at 05:27 +0900, Keith Busch wrote: > > On Wed, Nov 06, 2019 at 08:14:41PM +0000, Derrick, Jonathan wrote: > > > Yes that problem exists today > > > > Not really, because we're currently using unamanged interrupts which > > migrate to online CPUs. It's only the managed ones that don't migrate > > because they have a unchangeable affinity. > > > > > and this set limits the exposure as it's > > > a rare case where you have a child NVMe device with fewer than 32 > > > vectors. > > > > I'm deeply skeptical that is the case. Even the P3700 has only 31 IO > > queues, yielding 31 vectors for IO services, so that already won't work > > with this scheme. > > > Darn you're right. At one point I had VMD vector 1 using NVMe AQ, > leaving the 31 remaining VMD vectors for NVMe IO queues. This would > have lined up perfectly. Had changed it last minute and that does break > the scheme for P3700.... > > > But assuming you wanted to only use devices that have at least 32 IRQ > > vectors, the nvme driver also allows users to carve those vectors up > > into fully affinitized sets for different services (read vs. write is > > the one the block stack supports), which would also break if alignment > > to the parent device's IRQ setup is required. > > Wasn't aware of this. Fair enough. Marked the series with "Changes Requested", waiting for a new version. Lorenzo
On Mon, 2019-11-18 at 10:49 +0000, Lorenzo Pieralisi wrote: > On Wed, Nov 06, 2019 at 08:33:01PM +0000, Derrick, Jonathan wrote: > > On Thu, 2019-11-07 at 05:27 +0900, Keith Busch wrote: > > > On Wed, Nov 06, 2019 at 08:14:41PM +0000, Derrick, Jonathan wrote: > > > > Yes that problem exists today > > > > > > Not really, because we're currently using unamanged interrupts which > > > migrate to online CPUs. It's only the managed ones that don't migrate > > > because they have a unchangeable affinity. > > > > > > > and this set limits the exposure as it's > > > > a rare case where you have a child NVMe device with fewer than 32 > > > > vectors. > > > > > > I'm deeply skeptical that is the case. Even the P3700 has only 31 IO > > > queues, yielding 31 vectors for IO services, so that already won't work > > > with this scheme. > > > > > Darn you're right. At one point I had VMD vector 1 using NVMe AQ, > > leaving the 31 remaining VMD vectors for NVMe IO queues. This would > > have lined up perfectly. Had changed it last minute and that does break > > the scheme for P3700.... > > > > > But assuming you wanted to only use devices that have at least 32 IRQ > > > vectors, the nvme driver also allows users to carve those vectors up > > > into fully affinitized sets for different services (read vs. write is > > > the one the block stack supports), which would also break if alignment > > > to the parent device's IRQ setup is required. > > > > Wasn't aware of this. Fair enough. > > Marked the series with "Changes Requested", waiting for a new > version. > > Lorenzo This will need an entirely different strategy to be useful, so can be dropped for now. Thank you, Jon
diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c index 7aca925..be92076 100644 --- a/drivers/pci/controller/vmd.c +++ b/drivers/pci/controller/vmd.c @@ -157,22 +157,11 @@ static void vmd_irq_disable(struct irq_data *data) raw_spin_unlock_irqrestore(&list_lock, flags); } -/* - * XXX: Stubbed until we develop acceptable way to not create conflicts with - * other devices sharing the same vector. - */ -static int vmd_irq_set_affinity(struct irq_data *data, - const struct cpumask *dest, bool force) -{ - return -EINVAL; -} - static struct irq_chip vmd_msi_controller = { .name = "VMD-MSI", .irq_enable = vmd_irq_enable, .irq_disable = vmd_irq_disable, .irq_compose_msi_msg = vmd_compose_msi_msg, - .irq_set_affinity = vmd_irq_set_affinity, }; static irq_hw_number_t vmd_get_hwirq(struct msi_domain_info *info, @@ -722,6 +711,9 @@ static irqreturn_t vmd_irq(int irq, void *data) static int vmd_probe(struct pci_dev *dev, const struct pci_device_id *id) { struct vmd_dev *vmd; + struct irq_affinity affd = { + .pre_vectors = 1, + }; int i, err; if (resource_size(&dev->resource[VMD_CFGBAR]) < (1 << 20)) @@ -749,8 +741,8 @@ static int vmd_probe(struct pci_dev *dev, const struct pci_device_id *id) if (vmd->msix_count < 0) return -ENODEV; - vmd->msix_count = pci_alloc_irq_vectors(dev, 1, vmd->msix_count, - PCI_IRQ_MSIX); + vmd->msix_count = pci_alloc_irq_vectors_affinity(dev, 1, vmd->msix_count, + PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, &affd); if (vmd->msix_count < 0) return vmd->msix_count;
Using managed IRQ affinities sets up the VMD affinities identically to the child devices when those devices vector counts are limited by VMD. This promotes better affinity handling as interrupts won't necessarily need to pass context between non-local CPUs. One pre-vector is reserved for the slow interrupt and not considered in the affinity algorithm. Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> --- drivers/pci/controller/vmd.c | 18 +++++------------- 1 file changed, 5 insertions(+), 13 deletions(-)