[3/3] PCI: vmd: Use managed irq affinities

Message ID	1573040408-3831-4-git-send-email-jonathan.derrick@intel.com
State	Changes Requested
Delegated to:	Lorenzo Pieralisi
Headers	show Return-Path: <linux-pci-owner@vger.kernel.org> From: Jon Derrick <jonathan.derrick@intel.com> To: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: Keith Busch <kbusch@kernel.org>, Bjorn Helgaas <helgaas@kernel.org>, <linux-pci@vger.kernel.org>, Jon Derrick <jonathan.derrick@intel.com> Subject: [PATCH 3/3] PCI: vmd: Use managed irq affinities Date: Wed, 6 Nov 2019 04:40:08 -0700 Message-Id: <1573040408-3831-4-git-send-email-jonathan.derrick@intel.com> In-Reply-To: <1573040408-3831-1-git-send-email-jonathan.derrick@intel.com> References: <1573040408-3831-1-git-send-email-jonathan.derrick@intel.com> Sender: linux-pci-owner@vger.kernel.org Precedence: bulk
Series	PCI: vmd: Reducing tail latency by affining to the storage stack \| expand [0/3] PCI: vmd: Reducing tail latency by affining to the storage stack [1/3] PCI: vmd: Reduce VMD vectors using NVMe calculation [2/3] PCI: vmd: Align IRQ lists with child device vectors [3/3] PCI: vmd: Use managed irq affinities

Jon Derrick Nov. 6, 2019, 11:40 a.m. UTC

Using managed IRQ affinities sets up the VMD affinities identically to
the child devices when those devices vector counts are limited by VMD.
This promotes better affinity handling as interrupts won't necessarily
need to pass context between non-local CPUs. One pre-vector is reserved
for the slow interrupt and not considered in the affinity algorithm.

Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
---
 drivers/pci/controller/vmd.c | 18 +++++-------------
 1 file changed, 5 insertions(+), 13 deletions(-)

Keith Busch Nov. 6, 2019, 6:10 p.m. UTC | #1

On Wed, Nov 06, 2019 at 04:40:08AM -0700, Jon Derrick wrote:
> Using managed IRQ affinities sets up the VMD affinities identically to
> the child devices when those devices vector counts are limited by VMD.
> This promotes better affinity handling as interrupts won't necessarily
> need to pass context between non-local CPUs. One pre-vector is reserved
> for the slow interrupt and not considered in the affinity algorithm.

This only works if all devices have exactly the same number of interrupts
as the parent VMD host bridge. If a child device has less, the device
will stop working if you offline a cpu: the child device may have a
resource affined to other online cpus, but the VMD device affinity is to
that single offline cpu.

Jon Derrick Nov. 6, 2019, 8:14 p.m. UTC | #2

On Thu, 2019-11-07 at 03:10 +0900, Keith Busch wrote:
> On Wed, Nov 06, 2019 at 04:40:08AM -0700, Jon Derrick wrote:
> > Using managed IRQ affinities sets up the VMD affinities identically to
> > the child devices when those devices vector counts are limited by VMD.
> > This promotes better affinity handling as interrupts won't necessarily
> > need to pass context between non-local CPUs. One pre-vector is reserved
> > for the slow interrupt and not considered in the affinity algorithm.
> 
> This only works if all devices have exactly the same number of interrupts
> as the parent VMD host bridge. If a child device has less, the device
> will stop working if you offline a cpu: the child device may have a
> resource affined to other online cpus, but the VMD device affinity is to
> that single offline cpu.

Yes that problem exists today and this set limits the exposure as it's
a rare case where you have a child NVMe device with fewer than 32
vectors.

Keith Busch Nov. 6, 2019, 8:27 p.m. UTC | #3

On Wed, Nov 06, 2019 at 08:14:41PM +0000, Derrick, Jonathan wrote:
> Yes that problem exists today 

Not really, because we're currently using unamanged interrupts which
migrate to online CPUs. It's only the managed ones that don't migrate
because they have a unchangeable affinity.

> and this set limits the exposure as it's
> a rare case where you have a child NVMe device with fewer than 32
> vectors.

I'm deeply skeptical that is the case. Even the P3700 has only 31 IO
queues, yielding 31 vectors for IO services, so that already won't work
with this scheme.

But assuming you wanted to only use devices that have at least 32 IRQ
vectors, the nvme driver also allows users to carve those vectors up
into fully affinitized sets for different services (read vs. write is
the one the block stack supports), which would also break if alignment
to the parent device's IRQ setup is required.

Jon Derrick Nov. 6, 2019, 8:33 p.m. UTC | #4

On Thu, 2019-11-07 at 05:27 +0900, Keith Busch wrote:
> On Wed, Nov 06, 2019 at 08:14:41PM +0000, Derrick, Jonathan wrote:
> > Yes that problem exists today 
> 
> Not really, because we're currently using unamanged interrupts which
> migrate to online CPUs. It's only the managed ones that don't migrate
> because they have a unchangeable affinity.
> 
> > and this set limits the exposure as it's
> > a rare case where you have a child NVMe device with fewer than 32
> > vectors.
> 
> I'm deeply skeptical that is the case. Even the P3700 has only 31 IO
> queues, yielding 31 vectors for IO services, so that already won't work
> with this scheme.
> 
Darn you're right. At one point I had VMD vector 1 using NVMe AQ,
leaving the 31 remaining VMD vectors for NVMe IO queues. This would
have lined up perfectly. Had changed it last minute and that does break
the scheme for P3700....

> But assuming you wanted to only use devices that have at least 32 IRQ
> vectors, the nvme driver also allows users to carve those vectors up
> into fully affinitized sets for different services (read vs. write is
> the one the block stack supports), which would also break if alignment
> to the parent device's IRQ setup is required.

Wasn't aware of this. Fair enough.

Lorenzo Pieralisi Nov. 18, 2019, 10:49 a.m. UTC | #5

On Wed, Nov 06, 2019 at 08:33:01PM +0000, Derrick, Jonathan wrote:
> On Thu, 2019-11-07 at 05:27 +0900, Keith Busch wrote:
> > On Wed, Nov 06, 2019 at 08:14:41PM +0000, Derrick, Jonathan wrote:
> > > Yes that problem exists today 
> > 
> > Not really, because we're currently using unamanged interrupts which
> > migrate to online CPUs. It's only the managed ones that don't migrate
> > because they have a unchangeable affinity.
> > 
> > > and this set limits the exposure as it's
> > > a rare case where you have a child NVMe device with fewer than 32
> > > vectors.
> > 
> > I'm deeply skeptical that is the case. Even the P3700 has only 31 IO
> > queues, yielding 31 vectors for IO services, so that already won't work
> > with this scheme.
> > 
> Darn you're right. At one point I had VMD vector 1 using NVMe AQ,
> leaving the 31 remaining VMD vectors for NVMe IO queues. This would
> have lined up perfectly. Had changed it last minute and that does break
> the scheme for P3700....
> 
> > But assuming you wanted to only use devices that have at least 32 IRQ
> > vectors, the nvme driver also allows users to carve those vectors up
> > into fully affinitized sets for different services (read vs. write is
> > the one the block stack supports), which would also break if alignment
> > to the parent device's IRQ setup is required.
> 
> Wasn't aware of this. Fair enough.

Marked the series with "Changes Requested", waiting for a new
version.

Lorenzo

Jon Derrick Nov. 18, 2019, 4:43 p.m. UTC | #6

On Mon, 2019-11-18 at 10:49 +0000, Lorenzo Pieralisi wrote:
> On Wed, Nov 06, 2019 at 08:33:01PM +0000, Derrick, Jonathan wrote:
> > On Thu, 2019-11-07 at 05:27 +0900, Keith Busch wrote:
> > > On Wed, Nov 06, 2019 at 08:14:41PM +0000, Derrick, Jonathan wrote:
> > > > Yes that problem exists today 
> > > 
> > > Not really, because we're currently using unamanged interrupts which
> > > migrate to online CPUs. It's only the managed ones that don't migrate
> > > because they have a unchangeable affinity.
> > > 
> > > > and this set limits the exposure as it's
> > > > a rare case where you have a child NVMe device with fewer than 32
> > > > vectors.
> > > 
> > > I'm deeply skeptical that is the case. Even the P3700 has only 31 IO
> > > queues, yielding 31 vectors for IO services, so that already won't work
> > > with this scheme.
> > > 
> > Darn you're right. At one point I had VMD vector 1 using NVMe AQ,
> > leaving the 31 remaining VMD vectors for NVMe IO queues. This would
> > have lined up perfectly. Had changed it last minute and that does break
> > the scheme for P3700....
> > 
> > > But assuming you wanted to only use devices that have at least 32 IRQ
> > > vectors, the nvme driver also allows users to carve those vectors up
> > > into fully affinitized sets for different services (read vs. write is
> > > the one the block stack supports), which would also break if alignment
> > > to the parent device's IRQ setup is required.
> > 
> > Wasn't aware of this. Fair enough.
> 
> Marked the series with "Changes Requested", waiting for a new
> version.
> 
> Lorenzo

This will need an entirely different strategy to be useful, so can be
dropped for now.

Thank you,
Jon

[3/3] PCI: vmd: Use managed irq affinities

Commit Message

Comments

Patch