PCI/MSI: Avoid torn updates to MSI pairs

Message ID	20200116133102.1.I9c7e72144ef639cc135ea33ef332852a6b33730f@changeid
State	New
Headers	show Return-Path: <linux-pci-owner@vger.kernel.org> From: Evan Green <evgreen@chromium.org> To: Bjorn Helgaas <bhelgaas@google.com> Cc: Evan Green <evgreen@chromium.org>, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH] PCI/MSI: Avoid torn updates to MSI pairs Date: Thu, 16 Jan 2020 13:31:28 -0800 Message-Id: <20200116133102.1.I9c7e72144ef639cc135ea33ef332852a6b33730f@changeid> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-pci-owner@vger.kernel.org Precedence: bulk
Series	PCI/MSI: Avoid torn updates to MSI pairs \| expand PCI/MSI: Avoid torn updates to MSI pairs

Evan Green Jan. 16, 2020, 9:31 p.m. UTC

__pci_write_msi_msg() updates three registers in the device: address
high, address low, and data. On x86 systems, address low contains
CPU targeting info, and data contains the vector. The order of writes
is address, then data.

This is problematic if an interrupt comes in after address has
been written, but before data is updated, and the SMP affinity of
the interrupt is changing. In this case, the interrupt targets the
wrong vector on the new CPU.

This case is pretty easy to stumble into using xhci and CPU hotplugging.
Create a script that targets interrupts at a set of cores and then
offlines those cores. Put some stress on USB, and then watch xhci lose
an interrupt and die.

Avoid this by disabling MSIs during the update.

Signed-off-by: Evan Green <evgreen@chromium.org>
---


Bjorn,
I was unsure whether disabling MSIs temporarily is actually an okay
thing to do. I considered using the mask bit, but got the impression
that not all devices support the mask bit. Let me know if this going to
cause problems or there's a better way. I can include the repro
script I used to cause mayhem if needed.

---
 drivers/pci/msi.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Evan Green Jan. 18, 2020, 12:27 a.m. UTC | #1

On Thu, Jan 16, 2020 at 1:31 PM Evan Green <evgreen@chromium.org> wrote:
>
> __pci_write_msi_msg() updates three registers in the device: address
> high, address low, and data. On x86 systems, address low contains
> CPU targeting info, and data contains the vector. The order of writes
> is address, then data.
>
> This is problematic if an interrupt comes in after address has
> been written, but before data is updated, and the SMP affinity of
> the interrupt is changing. In this case, the interrupt targets the
> wrong vector on the new CPU.
>
> This case is pretty easy to stumble into using xhci and CPU hotplugging.
> Create a script that targets interrupts at a set of cores and then
> offlines those cores. Put some stress on USB, and then watch xhci lose
> an interrupt and die.
>
> Avoid this by disabling MSIs during the update.
>
> Signed-off-by: Evan Green <evgreen@chromium.org>

Note to reviewers: I posted a v2 of this patch with some improvements here:
https://lore.kernel.org/lkml/20200117162444.v2.1.I9c7e72144ef639cc135ea33ef332852a6b33730f@changeid/T/#u

Bjorn Helgaas Jan. 22, 2020, 5:28 p.m. UTC | #2

[+cc Thomas, Marc, Christoph, Rajat]

On Thu, Jan 16, 2020 at 01:31:28PM -0800, Evan Green wrote:
> __pci_write_msi_msg() updates three registers in the device: address
> high, address low, and data. On x86 systems, address low contains
> CPU targeting info, and data contains the vector. The order of writes
> is address, then data.
> 
> This is problematic if an interrupt comes in after address has
> been written, but before data is updated, and the SMP affinity of
> the interrupt is changing. In this case, the interrupt targets the
> wrong vector on the new CPU.
> 
> This case is pretty easy to stumble into using xhci and CPU hotplugging.
> Create a script that targets interrupts at a set of cores and then
> offlines those cores. Put some stress on USB, and then watch xhci lose
> an interrupt and die.
> 
> Avoid this by disabling MSIs during the update.
> 
> Signed-off-by: Evan Green <evgreen@chromium.org>
> ---
> 
> 
> Bjorn,
> I was unsure whether disabling MSIs temporarily is actually an okay
> thing to do. I considered using the mask bit, but got the impression
> that not all devices support the mask bit. Let me know if this going to
> cause problems or there's a better way. I can include the repro
> script I used to cause mayhem if needed.

I suspect this *is* a problem because I think disabling MSI doesn't
disable interrupts; it just means the device will interrupt using INTx
instead of MSI.  And the driver is probably not prepared to handle
INTx.

PCIe r5.0, sec 7.7.1.2, seems relevant: "If MSI and MSI-X are both
disabled, the Function requests servicing using INTx interrupts (if
supported)."

Maybe the IRQ guys have ideas about how to solve this?

> ---
>  drivers/pci/msi.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 6b43a5455c7af..97856ef862d68 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -328,7 +328,7 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
>  		u16 msgctl;
>  
>  		pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
> -		msgctl &= ~PCI_MSI_FLAGS_QSIZE;
> +		msgctl &= ~(PCI_MSI_FLAGS_QSIZE | PCI_MSI_FLAGS_ENABLE);
>  		msgctl |= entry->msi_attrib.multiple << 4;
>  		pci_write_config_word(dev, pos + PCI_MSI_FLAGS, msgctl);
>  
> @@ -343,6 +343,9 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
>  			pci_write_config_word(dev, pos + PCI_MSI_DATA_32,
>  					      msg->data);
>  		}
> +
> +		msgctl |= PCI_MSI_FLAGS_ENABLE;
> +		pci_write_config_word(dev, pos + PCI_MSI_FLAGS, msgctl);
>  	}
>  
>  skip:
> -- 
> 2.24.1
>

Evan Green Jan. 22, 2020, 6:26 p.m. UTC | #3

On Wed, Jan 22, 2020 at 9:28 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc Thomas, Marc, Christoph, Rajat]
>
> On Thu, Jan 16, 2020 at 01:31:28PM -0800, Evan Green wrote:
> > __pci_write_msi_msg() updates three registers in the device: address
> > high, address low, and data. On x86 systems, address low contains
> > CPU targeting info, and data contains the vector. The order of writes
> > is address, then data.
> >
> > This is problematic if an interrupt comes in after address has
> > been written, but before data is updated, and the SMP affinity of
> > the interrupt is changing. In this case, the interrupt targets the
> > wrong vector on the new CPU.
> >
> > This case is pretty easy to stumble into using xhci and CPU hotplugging.
> > Create a script that targets interrupts at a set of cores and then
> > offlines those cores. Put some stress on USB, and then watch xhci lose
> > an interrupt and die.
> >
> > Avoid this by disabling MSIs during the update.
> >
> > Signed-off-by: Evan Green <evgreen@chromium.org>
> > ---
> >
> >
> > Bjorn,
> > I was unsure whether disabling MSIs temporarily is actually an okay
> > thing to do. I considered using the mask bit, but got the impression
> > that not all devices support the mask bit. Let me know if this going to
> > cause problems or there's a better way. I can include the repro
> > script I used to cause mayhem if needed.
>
> I suspect this *is* a problem because I think disabling MSI doesn't
> disable interrupts; it just means the device will interrupt using INTx
> instead of MSI.  And the driver is probably not prepared to handle
> INTx.
>
> PCIe r5.0, sec 7.7.1.2, seems relevant: "If MSI and MSI-X are both
> disabled, the Function requests servicing using INTx interrupts (if
> supported)."
>
> Maybe the IRQ guys have ideas about how to solve this?
>

But don't we already do this in __pci_restore_msi_state():
        pci_intx_for_msi(dev, 0);
        pci_msi_set_enable(dev, 0);
        arch_restore_msi_irqs(dev);

I'd think if there were a chance for a line-based interrupt to get in
and wedge itself, it would already be happening there.

One other way you could avoid torn MSI writes would be to ensure that
if you migrate IRQs across cores, you keep the same x86 vector number.
That way the address portion would be updated, and data doesn't
change, so there's no window. But that may not actually be feasible.

-Evan

Marc Zyngier Jan. 22, 2020, 6:52 p.m. UTC | #4

On 2020-01-22 17:28, Bjorn Helgaas wrote:
> [+cc Thomas, Marc, Christoph, Rajat]
> 
> On Thu, Jan 16, 2020 at 01:31:28PM -0800, Evan Green wrote:
>> __pci_write_msi_msg() updates three registers in the device: address
>> high, address low, and data. On x86 systems, address low contains
>> CPU targeting info, and data contains the vector. The order of writes
>> is address, then data.
>> 
>> This is problematic if an interrupt comes in after address has
>> been written, but before data is updated, and the SMP affinity of
>> the interrupt is changing. In this case, the interrupt targets the
>> wrong vector on the new CPU.
>> 
>> This case is pretty easy to stumble into using xhci and CPU 
>> hotplugging.
>> Create a script that targets interrupts at a set of cores and then
>> offlines those cores. Put some stress on USB, and then watch xhci lose
>> an interrupt and die.
>> 
>> Avoid this by disabling MSIs during the update.
>> 
>> Signed-off-by: Evan Green <evgreen@chromium.org>
>> ---
>> 
>> 
>> Bjorn,
>> I was unsure whether disabling MSIs temporarily is actually an okay
>> thing to do. I considered using the mask bit, but got the impression
>> that not all devices support the mask bit. Let me know if this going 
>> to
>> cause problems or there's a better way. I can include the repro
>> script I used to cause mayhem if needed.
> 
> I suspect this *is* a problem because I think disabling MSI doesn't
> disable interrupts; it just means the device will interrupt using INTx
> instead of MSI.  And the driver is probably not prepared to handle
> INTx.
> 
> PCIe r5.0, sec 7.7.1.2, seems relevant: "If MSI and MSI-X are both
> disabled, the Function requests servicing using INTx interrupts (if
> supported)."
> 
> Maybe the IRQ guys have ideas about how to solve this?

Not from the top of my head. MSI-X should always support masking,
so we could at least handle that case properly and not loose interrupts.
Good ol' MSI is more tricky. Disabling MSI, as Bjorn pointed out, is
just going to make the problem worse.

There is also the problem that a number of drivers pick MSI instead of
MSI-X.

         M.

Thomas Gleixner Jan. 22, 2020, 11:37 p.m. UTC | #5

Evan Green <evgreen@chromium.org> writes:
> On Wed, Jan 22, 2020 at 9:28 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>> I suspect this *is* a problem because I think disabling MSI doesn't
>> disable interrupts; it just means the device will interrupt using INTx
>> instead of MSI.  And the driver is probably not prepared to handle
>> INTx.
>>
>> PCIe r5.0, sec 7.7.1.2, seems relevant: "If MSI and MSI-X are both
>> disabled, the Function requests servicing using INTx interrupts (if
>> supported)."

Disabling MSI is not an option. Masking yes, but MSI does not have
mandatory masking. We already attempt masking on migration, which covers
only MSI-X reliably, but not all MSI incarnations.

So I assume that problem happens on a MSI interrupt, right?

>> Maybe the IRQ guys have ideas about how to solve this?

Maybe :)

> But don't we already do this in __pci_restore_msi_state():
>         pci_intx_for_msi(dev, 0);
>         pci_msi_set_enable(dev, 0);
>         arch_restore_msi_irqs(dev);
>
> I'd think if there were a chance for a line-based interrupt to get in
> and wedge itself, it would already be happening there.

That's a completely different beast. It's used when resetting a device
and for other stuff like virt state migration. That's not a model for
affinity changes of a live device.

> One other way you could avoid torn MSI writes would be to ensure that
> if you migrate IRQs across cores, you keep the same x86 vector number.
> That way the address portion would be updated, and data doesn't
> change, so there's no window. But that may not actually be feasible.

That's not possible simply because the x86 vector space is limited. If
we would have to guarantee that then we'd end up with a max of ~220
interrupts per system. Sufficient for your notebook, but the big iron
people would be not amused.

The real critical path here is the CPU hotplug path.

For regular migration between two online CPUs we use the 'migrate when
the irq is actually serviced ' mechanism. That might have the same issue
on misdesigned devices which are firing the next interrupt before the
one on the flight is serviced, but I haven't seen any reports with that
symptom yet.

But before I dig deeper into this, please provide the output of

'lscpci -vvv' and 'cat /proc/interrupts'

Thanks,

        tglx

Evan Green Jan. 23, 2020, 12:07 a.m. UTC | #6

On Wed, Jan 22, 2020 at 3:37 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Evan Green <evgreen@chromium.org> writes:
> > On Wed, Jan 22, 2020 at 9:28 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >> I suspect this *is* a problem because I think disabling MSI doesn't
> >> disable interrupts; it just means the device will interrupt using INTx
> >> instead of MSI.  And the driver is probably not prepared to handle
> >> INTx.
> >>
> >> PCIe r5.0, sec 7.7.1.2, seems relevant: "If MSI and MSI-X are both
> >> disabled, the Function requests servicing using INTx interrupts (if
> >> supported)."
>
> Disabling MSI is not an option. Masking yes, but MSI does not have
> mandatory masking. We already attempt masking on migration, which covers
> only MSI-X reliably, but not all MSI incarnations.
>
> So I assume that problem happens on a MSI interrupt, right?
>
> >> Maybe the IRQ guys have ideas about how to solve this?
>
> Maybe :)
>
> > But don't we already do this in __pci_restore_msi_state():
> >         pci_intx_for_msi(dev, 0);
> >         pci_msi_set_enable(dev, 0);
> >         arch_restore_msi_irqs(dev);
> >
> > I'd think if there were a chance for a line-based interrupt to get in
> > and wedge itself, it would already be happening there.
>
> That's a completely different beast. It's used when resetting a device
> and for other stuff like virt state migration. That's not a model for
> affinity changes of a live device.

Hm. Ok.

>
> > One other way you could avoid torn MSI writes would be to ensure that
> > if you migrate IRQs across cores, you keep the same x86 vector number.
> > That way the address portion would be updated, and data doesn't
> > change, so there's no window. But that may not actually be feasible.
>
> That's not possible simply because the x86 vector space is limited. If
> we would have to guarantee that then we'd end up with a max of ~220
> interrupts per system. Sufficient for your notebook, but the big iron
> people would be not amused.

Right, that occurred to me as well. The actual requirement isn't quite
as restrictive. What you really need is the old vector to be
registered on both the old CPU and the new CPU. Then once the
interrupt is confirmed to have moved we could release both the old
vector both CPUs, leaving only the new vector on the new CPU.

In that world some SMP affinity transitions might fail, which is a
bummer. To avoid that, you could first migrate to a vector that's
available on both the source and destination CPUs, keeping affinity
the same. Then change affinity in a separate step.

Or alternatively, you could permanently designate a "transit" vector.
If an interrupt fires on this vector, then we call all ISRs currently
in transit between CPUs. You might end up calling ISRs that didn't
actually need service, but at least that's better than missing edges.

>
> The real critical path here is the CPU hotplug path.
>
> For regular migration between two online CPUs we use the 'migrate when
> the irq is actually serviced ' mechanism. That might have the same issue
> on misdesigned devices which are firing the next interrupt before the
> one on the flight is serviced, but I haven't seen any reports with that
> symptom yet.
>
> But before I dig deeper into this, please provide the output of
>
> 'lscpci -vvv' and 'cat /proc/interrupts'
>

Here it is:
https://pastebin.com/YyxBUvQ2

This is a Comet Lake system. It has 8 HT cores, but 4 of those cores
have already been offlined.

At the bottom of the paste I also included the script I used that
causes a repro in a minute or two. I simply run this, then put some
stress on USB. For me that stress was "join a Hangouts meeting", since
that stressed both my USB webcam and USB ethernet. The script exits
when xhci dies.
-Evan

Thomas Gleixner Jan. 23, 2020, 8:42 a.m. UTC | #7

Evan Green <evgreen@chromium.org> writes:
> On Wed, Jan 22, 2020 at 3:37 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > One other way you could avoid torn MSI writes would be to ensure that
>> > if you migrate IRQs across cores, you keep the same x86 vector number.
>> > That way the address portion would be updated, and data doesn't
>> > change, so there's no window. But that may not actually be feasible.
>>
>> That's not possible simply because the x86 vector space is limited. If
>> we would have to guarantee that then we'd end up with a max of ~220
>> interrupts per system. Sufficient for your notebook, but the big iron
>> people would be not amused.
>
> Right, that occurred to me as well. The actual requirement isn't quite
> as restrictive. What you really need is the old vector to be
> registered on both the old CPU and the new CPU. Then once the
> interrupt is confirmed to have moved we could release both the old
> vector both CPUs, leaving only the new vector on the new CPU.

Sure, and how can you guarantee that without reserving the vector on all
CPUs in the first place? If you don't do that then if the vector is not
available affinity setting would fail every so often and it would pretty
much prevent hotplug if a to be migrated vector is not available on at
least one online CPU.

> In that world some SMP affinity transitions might fail, which is a
> bummer. To avoid that, you could first migrate to a vector that's
> available on both the source and destination CPUs, keeping affinity
> the same. Then change affinity in a separate step.

Good luck with doing that at the end of the hotplug routine where the
CPU is about to vanish.

> Or alternatively, you could permanently designate a "transit" vector.
> If an interrupt fires on this vector, then we call all ISRs currently
> in transit between CPUs. You might end up calling ISRs that didn't
> actually need service, but at least that's better than missing edges.

I don't think we need that. While walking the dogs I thought about
invoking a force migrated interrupt on the target CPU, but haven't
thought it through yet.

>> 'lscpci -vvv' and 'cat /proc/interrupts'
>
> Here it is:
> https://pastebin.com/YyxBUvQ2

Hrm:

        Capabilities: [80] MSI-X: Enable+ Count=16 Masked-

So this is weird. We mask it before moving it, so the tear issue should
not happen on MSI-X. So the tearing might be just a red herring.

Let me stare into the code a bit.

Thanks,

        tglx

Evan Green Jan. 23, 2020, 5:55 p.m. UTC | #8

On Thu, Jan 23, 2020 at 12:42 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Evan Green <evgreen@chromium.org> writes:
> > On Wed, Jan 22, 2020 at 3:37 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> > One other way you could avoid torn MSI writes would be to ensure that
> >> > if you migrate IRQs across cores, you keep the same x86 vector number.
> >> > That way the address portion would be updated, and data doesn't
> >> > change, so there's no window. But that may not actually be feasible.
> >>
> >> That's not possible simply because the x86 vector space is limited. If
> >> we would have to guarantee that then we'd end up with a max of ~220
> >> interrupts per system. Sufficient for your notebook, but the big iron
> >> people would be not amused.
> >
> > Right, that occurred to me as well. The actual requirement isn't quite
> > as restrictive. What you really need is the old vector to be
> > registered on both the old CPU and the new CPU. Then once the
> > interrupt is confirmed to have moved we could release both the old
> > vector both CPUs, leaving only the new vector on the new CPU.
>
> Sure, and how can you guarantee that without reserving the vector on all
> CPUs in the first place? If you don't do that then if the vector is not
> available affinity setting would fail every so often and it would pretty
> much prevent hotplug if a to be migrated vector is not available on at
> least one online CPU.
>
> > In that world some SMP affinity transitions might fail, which is a
> > bummer. To avoid that, you could first migrate to a vector that's
> > available on both the source and destination CPUs, keeping affinity
> > the same. Then change affinity in a separate step.
>
> Good luck with doing that at the end of the hotplug routine where the
> CPU is about to vanish.
>
> > Or alternatively, you could permanently designate a "transit" vector.
> > If an interrupt fires on this vector, then we call all ISRs currently
> > in transit between CPUs. You might end up calling ISRs that didn't
> > actually need service, but at least that's better than missing edges.
>
> I don't think we need that. While walking the dogs I thought about
> invoking a force migrated interrupt on the target CPU, but haven't
> thought it through yet.

Yeah, I think the Intel folks did that in some tree of theirs too.

>
> >> 'lscpci -vvv' and 'cat /proc/interrupts'
> >
> > Here it is:
> > https://pastebin.com/YyxBUvQ2
>
> Hrm:
>
>         Capabilities: [80] MSI-X: Enable+ Count=16 Masked-
>
> So this is weird. We mask it before moving it, so the tear issue should
> not happen on MSI-X. So the tearing might be just a red herring.

Mmm... sorry what? This is the complete entry for xhci:

00:14.0 USB controller: Intel Corporation Device 02ed (prog-if 30 [XHCI])
        Subsystem: Intel Corporation Device 7270
        Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin A routed to IRQ 124
        Region 0: Memory at d1200000 (64-bit, non-prefetchable) [size=64K]
        Capabilities: [70] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA
PME(D0-,D1-,D2-,D3hot+,D3cold+)
                Status: D3 NoSoftRst+ PME-Enable+ DSel=0 DScale=0 PME-
        Capabilities: [80] MSI: Enable+ Count=1/8 Maskable- 64bit+
                Address: 00000000fee10004  Data: 402a
        Capabilities: [90] Vendor Specific Information: Len=14 <?>
        Kernel driver in use: xhci_hcd


>
> Let me stare into the code a bit.

Thanks, I appreciate the help.

-Evan

Thomas Gleixner Jan. 23, 2020, 11:47 p.m. UTC | #9

Evan Green <evgreen@chromium.org> writes:
> On Thu, Jan 23, 2020 at 12:42 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> Hrm:
>>
>>         Capabilities: [80] MSI-X: Enable+ Count=16 Masked-
>>
>> So this is weird. We mask it before moving it, so the tear issue should
>> not happen on MSI-X. So the tearing might be just a red herring.
>
> Mmm... sorry what? This is the complete entry for xhci:
>
> 00:14.0 USB controller: Intel Corporation Device 02ed (prog-if 30 [XHCI])
>         Subsystem: Intel Corporation Device 7270
>         Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop-
> ParErr- Stepping- SERR- FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium
>>TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Interrupt: pin A routed to IRQ 124
>         Region 0: Memory at d1200000 (64-bit, non-prefetchable) [size=64K]
>         Capabilities: [70] Power Management version 2
>                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA
> PME(D0-,D1-,D2-,D3hot+,D3cold+)
>                 Status: D3 NoSoftRst+ PME-Enable+ DSel=0 DScale=0 PME-
>         Capabilities: [80] MSI: Enable+ Count=1/8 Maskable- 64bit+
>                 Address: 00000000fee10004  Data: 402a
>         Capabilities: [90] Vendor Specific Information: Len=14 <?>
>         Kernel driver in use: xhci_hcd

Bah. I was looking at the WIFI for whatever reason.

Thanks,

        tglx

PCI/MSI: Avoid torn updates to MSI pairs

Commit Message

Comments

Patch