mbox series

[v9,00/14] pci: Work around ASMedia ASM2824 PCIe link training failures

Message ID alpine.DEB.2.21.2305310024400.59226@angie.orcam.me.uk (mailing list archive)
Headers show
Series pci: Work around ASMedia ASM2824 PCIe link training failures | expand

Message

Maciej W. Rozycki June 11, 2023, 5:19 p.m. UTC
Hi,

 This is v9 of the change to work around a PCIe link training phenomenon 
where a pair of devices both capable of operating at a link speed above 
2.5GT/s seems unable to negotiate the link speed and continues training 
indefinitely with the Link Training bit switching on and off repeatedly 
and the data link layer never reaching the active state.

 With several requests addressed and a few extra issues spotted this
version has now grown to 14 patches.  It has been verified for device 
enumeration with and without PCI_QUIRKS enabled, using the same piece of 
RISC-V hardware as previously.  Hot plug or reset events have not been 
verified, as this is difficult if at all feasible with hardware in 
question.

 Last iteration: 
<https://lore.kernel.org/r/alpine.DEB.2.21.2304060100160.13659@angie.orcam.me.uk/>, 
and my input to it:
<https://lore.kernel.org/r/alpine.DEB.2.21.2306080224280.36323@angie.orcam.me.uk/>.

  Maciej

Comments

Bjorn Helgaas June 14, 2023, 11:12 p.m. UTC | #1
On Sun, Jun 11, 2023 at 06:19:08PM +0100, Maciej W. Rozycki wrote:
> Hi,
> 
>  This is v9 of the change to work around a PCIe link training phenomenon 
> where a pair of devices both capable of operating at a link speed above 
> 2.5GT/s seems unable to negotiate the link speed and continues training 
> indefinitely with the Link Training bit switching on and off repeatedly 
> and the data link layer never reaching the active state.
> 
>  With several requests addressed and a few extra issues spotted this
> version has now grown to 14 patches.  It has been verified for device 
> enumeration with and without PCI_QUIRKS enabled, using the same piece of 
> RISC-V hardware as previously.  Hot plug or reset events have not been 
> verified, as this is difficult if at all feasible with hardware in 
> question.
> 
>  Last iteration: 
> <https://lore.kernel.org/r/alpine.DEB.2.21.2304060100160.13659@angie.orcam.me.uk/>, 
> and my input to it:
> <https://lore.kernel.org/r/alpine.DEB.2.21.2306080224280.36323@angie.orcam.me.uk/>.

Thanks, I applied these to pci/enumeration for v6.5.

I tweaked a few things, so double-check to be sure I didn't break
something:

  - Moved dev->link_active_reporting init to set_pcie_port_type()
    because it does other PCIe-related stuff.

  - Reordered to keep all the link_active_reporting things together.

  - Reordered to clean up & factor pcie_retrain_link() before exposing
    it to the rest of the PCI core.

  - Moved pcie_retrain_link() a little earlier to keep it next to
    pcie_wait_for_link_status().

  - Squashed the stubs into the actual quirk so we don't have the
    intermediate state where we call the stubs but they never do
    anything (let me know if there's a reason we need your order).

  - Inline pcie_parent_link_retrain(), which seemed like it didn't add
    enough to be worthwhile.

Interdiff below:

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 80694e2574b8..f11268924c8f 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1153,27 +1153,16 @@ void pci_resume_bus(struct pci_bus *bus)
 		pci_walk_bus(bus, pci_resume_one, NULL);
 }
 
-/**
- * pcie_parent_link_retrain - Check and retrain link we are downstream from
- * @dev: PCI device to handle.
- *
- * Return TRUE if the link was retrained, FALSE otherwise.
- */
-static bool pcie_parent_link_retrain(struct pci_dev *dev)
-{
-	struct pci_dev *bridge;
-
-	bridge = pci_upstream_bridge(dev);
-	if (bridge)
-		return pcie_failed_link_retrain(bridge);
-	else
-		return false;
-}
-
 static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
 {
-	bool retrain = true;
 	int delay = 1;
+	bool retrain = false;
+	struct pci_dev *bridge;
+
+	if (pci_is_pcie(dev)) {
+		retrain = true;
+		bridge = pci_upstream_bridge(dev);
+	}
 
 	/*
 	 * After reset, the device should not silently discard config
@@ -1201,9 +1190,9 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
 		}
 
 		if (delay > PCI_RESET_WAIT) {
-			if (retrain) {
+			if (retrain && bridge) {
 				retrain = false;
-				if (pcie_parent_link_retrain(dev)) {
+				if (pcie_failed_link_retrain(bridge)) {
 					delay = 1;
 					continue;
 				}
@@ -4914,6 +4903,38 @@ static bool pcie_wait_for_link_status(struct pci_dev *pdev,
 	return (lnksta & lnksta_mask) == lnksta_match;
 }
 
+/**
+ * pcie_retrain_link - Request a link retrain and wait for it to complete
+ * @pdev: Device whose link to retrain.
+ * @use_lt: Use the LT bit if TRUE, or the DLLLA bit if FALSE, for status.
+ *
+ * Retrain completion status is retrieved from the Link Status Register
+ * according to @use_lt.  It is not verified whether the use of the DLLLA
+ * bit is valid.
+ *
+ * Return TRUE if successful, or FALSE if training has not completed
+ * within PCIE_LINK_RETRAIN_TIMEOUT_MS milliseconds.
+ */
+bool pcie_retrain_link(struct pci_dev *pdev, bool use_lt)
+{
+	u16 lnkctl;
+
+	pcie_capability_read_word(pdev, PCI_EXP_LNKCTL, &lnkctl);
+	lnkctl |= PCI_EXP_LNKCTL_RL;
+	pcie_capability_write_word(pdev, PCI_EXP_LNKCTL, lnkctl);
+	if (pdev->clear_retrain_link) {
+		/*
+		 * Due to an erratum in some devices the Retrain Link bit
+		 * needs to be cleared again manually to allow the link
+		 * training to succeed.
+		 */
+		lnkctl &= ~PCI_EXP_LNKCTL_RL;
+		pcie_capability_write_word(pdev, PCI_EXP_LNKCTL, lnkctl);
+	}
+
+	return pcie_wait_for_link_status(pdev, use_lt, !use_lt);
+}
+
 /**
  * pcie_wait_for_link_delay - Wait until link is active or inactive
  * @pdev: Bridge device
@@ -4968,37 +4989,6 @@ bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
 	return pcie_wait_for_link_delay(pdev, active, 100);
 }
 
-/**
- * pcie_retrain_link - Request a link retrain and wait for it to complete
- * @pdev: Device whose link to retrain.
- * @use_lt: Use the LT bit if TRUE, or the DLLLA bit if FALSE, for status.
- *
- * Retrain completion status is retrieved from the Link Status Register
- * according to @use_lt.  It is not verified whether the use of the DLLLA
- * bit is valid.
- *
- * Return TRUE if successful, or FALSE if training has not completed.
- */
-bool pcie_retrain_link(struct pci_dev *pdev, bool use_lt)
-{
-	u16 lnkctl;
-
-	pcie_capability_read_word(pdev, PCI_EXP_LNKCTL, &lnkctl);
-	lnkctl |= PCI_EXP_LNKCTL_RL;
-	pcie_capability_write_word(pdev, PCI_EXP_LNKCTL, lnkctl);
-	if (pdev->clear_retrain_link) {
-		/*
-		 * Due to an erratum in some devices the Retrain Link bit
-		 * needs to be cleared again manually to allow the link
-		 * training to succeed.
-		 */
-		lnkctl &= ~PCI_EXP_LNKCTL_RL;
-		pcie_capability_write_word(pdev, PCI_EXP_LNKCTL, lnkctl);
-	}
-
-	return pcie_wait_for_link_status(pdev, use_lt, !use_lt);
-}
-
 /*
  * Find maximum D3cold delay required by all the devices on the bus.  The
  * spec says 100 ms, but firmware can lower it and we allow drivers to
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 016a9d4a61f7..f547db0a728f 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1526,6 +1526,7 @@ void set_pcie_port_type(struct pci_dev *pdev)
 {
 	int pos;
 	u16 reg16;
+	u32 reg32;
 	int type;
 	struct pci_dev *parent;
 
@@ -1539,6 +1540,10 @@ void set_pcie_port_type(struct pci_dev *pdev)
 	pci_read_config_dword(pdev, pos + PCI_EXP_DEVCAP, &pdev->devcap);
 	pdev->pcie_mpss = FIELD_GET(PCI_EXP_DEVCAP_PAYLOAD, pdev->devcap);
 
+	pcie_capability_read_dword(pdev, PCI_EXP_LNKCAP, &reg32);
+	if (reg32 & PCI_EXP_LNKCAP_DLLLARC)
+		pdev->link_active_reporting = 1;
+
 	parent = pci_upstream_bridge(pdev);
 	if (!parent)
 		return;
@@ -1828,7 +1833,6 @@ int pci_setup_device(struct pci_dev *dev)
 	int err, pos = 0;
 	struct pci_bus_region region;
 	struct resource *res;
-	u32 linkcap;
 
 	hdr_type = pci_hdr_type(dev);
 
@@ -1876,10 +1880,6 @@ int pci_setup_device(struct pci_dev *dev)
 	/* "Unknown power state" */
 	dev->current_state = PCI_UNKNOWN;
 
-	/* Set it early to make it available to fixups, etc.  */
-	pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &linkcap);
-	dev->link_active_reporting = !!(linkcap & PCI_EXP_LNKCAP_DLLLARC);
-
 	/* Early fixups, before probing the BARs */
 	pci_fixup_device(pci_fixup_early, dev);
Maciej W. Rozycki June 15, 2023, 12:41 a.m. UTC | #2
On Wed, 14 Jun 2023, Bjorn Helgaas wrote:

> >  This is v9 of the change to work around a PCIe link training phenomenon 
> > where a pair of devices both capable of operating at a link speed above 
> > 2.5GT/s seems unable to negotiate the link speed and continues training 
> > indefinitely with the Link Training bit switching on and off repeatedly 
> > and the data link layer never reaching the active state.
> > 
> >  With several requests addressed and a few extra issues spotted this
> > version has now grown to 14 patches.  It has been verified for device 
> > enumeration with and without PCI_QUIRKS enabled, using the same piece of 
> > RISC-V hardware as previously.  Hot plug or reset events have not been 
> > verified, as this is difficult if at all feasible with hardware in 
> > question.
> > 
> >  Last iteration: 
> > <https://lore.kernel.org/r/alpine.DEB.2.21.2304060100160.13659@angie.orcam.me.uk/>, 
> > and my input to it:
> > <https://lore.kernel.org/r/alpine.DEB.2.21.2306080224280.36323@angie.orcam.me.uk/>.
> 
> Thanks, I applied these to pci/enumeration for v6.5.

 Great, thanks!

> I tweaked a few things, so double-check to be sure I didn't break
> something:
> 
>   - Moved dev->link_active_reporting init to set_pcie_port_type()
>     because it does other PCIe-related stuff.
> 
>   - Reordered to keep all the link_active_reporting things together.
> 
>   - Reordered to clean up & factor pcie_retrain_link() before exposing
>     it to the rest of the PCI core.
> 
>   - Moved pcie_retrain_link() a little earlier to keep it next to
>     pcie_wait_for_link_status().
> 
>   - Squashed the stubs into the actual quirk so we don't have the
>     intermediate state where we call the stubs but they never do
>     anything (let me know if there's a reason we need your order).
> 
>   - Inline pcie_parent_link_retrain(), which seemed like it didn't add
>     enough to be worthwhile.

 Ack, I'll double-check and report back.  A minor nit I've spotted below:

>  static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>  {
> -	bool retrain = true;
>  	int delay = 1;
> +	bool retrain = false;
> +	struct pci_dev *bridge;
> +
> +	if (pci_is_pcie(dev)) {
> +		retrain = true;
> +		bridge = pci_upstream_bridge(dev);
> +	}

 If doing it this way, which I actually like, I think it would be a little 
bit better performance- and style-wise if this was written as:

	if (pci_is_pcie(dev)) {
		bridge = pci_upstream_bridge(dev);
		retrain = !!bridge;
	}

(or "retrain = bridge != NULL" if you prefer this style), and then we 
don't have to repeatedly check two variables iff (pcie && !bridge) in the 
loop below:

> @@ -1201,9 +1190,9 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>  		}
>  
>  		if (delay > PCI_RESET_WAIT) {
> -			if (retrain) {
> +			if (retrain && bridge) {

-- i.e. code can stay then as:

			if (retrain) {

here.  I hope you find this observation rather obvious, so will you amend 
your tree, or shall I send an incremental update?

 Otherwise I don't find anything suspicious with the interdiff itself 
(thanks for posting it, that's really useful indeed!), but as I say I'll 
yet double-check how things look and work with your tree.  Hopefully 
tomorrow (Thu), as I have other stuff yet to complete tonight.

  Maciej
Bjorn Helgaas June 15, 2023, 6:37 p.m. UTC | #3
On Thu, Jun 15, 2023 at 01:41:10AM +0100, Maciej W. Rozycki wrote:
> On Wed, 14 Jun 2023, Bjorn Helgaas wrote:
> 
> > >  This is v9 of the change to work around a PCIe link training phenomenon 
> > > where a pair of devices both capable of operating at a link speed above 
> > > 2.5GT/s seems unable to negotiate the link speed and continues training 
> > > indefinitely with the Link Training bit switching on and off repeatedly 
> > > and the data link layer never reaching the active state.
> > > 
> > >  With several requests addressed and a few extra issues spotted this
> > > version has now grown to 14 patches.  It has been verified for device 
> > > enumeration with and without PCI_QUIRKS enabled, using the same piece of 
> > > RISC-V hardware as previously.  Hot plug or reset events have not been 
> > > verified, as this is difficult if at all feasible with hardware in 
> > > question.

> >  static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
> >  {
> > -	bool retrain = true;
> >  	int delay = 1;
> > +	bool retrain = false;
> > +	struct pci_dev *bridge;
> > +
> > +	if (pci_is_pcie(dev)) {
> > +		retrain = true;
> > +		bridge = pci_upstream_bridge(dev);
> > +	}
> 
>  If doing it this way, which I actually like, I think it would be a little 
> bit better performance- and style-wise if this was written as:
> 
> 	if (pci_is_pcie(dev)) {
> 		bridge = pci_upstream_bridge(dev);
> 		retrain = !!bridge;
> 	}
> 
> (or "retrain = bridge != NULL" if you prefer this style), and then we 
> don't have to repeatedly check two variables iff (pcie && !bridge) in the 
> loop below:

Done, thanks, I do like that better.  I did:

  bridge = pci_upstream_bridge(dev);
  if (bridge)
    retrain = true;

because it seems like it flows more naturally when reading.

Bjorn
Maciej W. Rozycki June 16, 2023, 12:27 p.m. UTC | #4
On Thu, 15 Jun 2023, Bjorn Helgaas wrote:

> >  If doing it this way, which I actually like, I think it would be a little 
> > bit better performance- and style-wise if this was written as:
> > 
> > 	if (pci_is_pcie(dev)) {
> > 		bridge = pci_upstream_bridge(dev);
> > 		retrain = !!bridge;
> > 	}
> > 
> > (or "retrain = bridge != NULL" if you prefer this style), and then we 
> > don't have to repeatedly check two variables iff (pcie && !bridge) in the 
> > loop below:
> 
> Done, thanks, I do like that better.  I did:
> 
>   bridge = pci_upstream_bridge(dev);
>   if (bridge)
>     retrain = true;
> 
> because it seems like it flows more naturally when reading.

 Perfect, and good timing too, as I have just started checking your tree 
as your message arrived.  I ran my usual tests with and w/o PCI_QUIRKS 
enabled and results were as expected.  As before I didn't check hot plug 
and reset paths as these features are awkward with the HiFive Unmatched 
system involved.

 I have skimmed over the changes as committed to pci/enumeration and found 
nothing suspicious.  I have verified that the tree builds as at each of 
them with my configuration.

 As per my earlier remark:

> I think making a system halfway-fixed would make little sense, but with
> the actual fix actually made last as you suggested I think this can be
> split off, because it'll make no functional change by itself.

I am not perfectly happy with your rearrangement to fold the !PCI_QUIRKS 
stub into the change carrying the actual workaround and then have the 
reset path update with a follow-up change only, but I won't fight over it.  
It's only one tree revision that will be in this halfway-fixed state and 
I'll trust your judgement here.

 Let me know if anything pops up related to these changes anytime and I'll 
be happy to look into it.  The system involved is nearing two years since 
its deployment already, but hopefully it has many years to go yet and will 
continue being ready to verify things.  It's not that there's lots of real 
RISC-V hardware available, let alone with PCI/e connectivity.

 Thank you for staying with me and reviewing this patch series through all 
the iterations.

  Maciej
Bjorn Helgaas June 16, 2023, 8:29 p.m. UTC | #5
On Fri, Jun 16, 2023 at 01:27:52PM +0100, Maciej W. Rozycki wrote:
> On Thu, 15 Jun 2023, Bjorn Helgaas wrote:

>  As per my earlier remark:
> 
> > I think making a system halfway-fixed would make little sense, but with
> > the actual fix actually made last as you suggested I think this can be
> > split off, because it'll make no functional change by itself.
> 
> I am not perfectly happy with your rearrangement to fold the !PCI_QUIRKS 
> stub into the change carrying the actual workaround and then have the 
> reset path update with a follow-up change only, but I won't fight over it.  
> It's only one tree revision that will be in this halfway-fixed state and 
> I'll trust your judgement here.

Thanks for raising this.  Here's my thought process:

  12 PCI: Provide stub failed link recovery for device probing and hot plug
  13 PCI: Add failed link recovery for device reset events
  14 PCI: Work around PCIe link training failures

Patch 12 [1] adds calls to pcie_failed_link_retrain(), which does
nothing and returns false.  Functionally, it's a no-op, but the
structure is important later.

Patch 13 [2] claims to request failed link recovery after resets, but
actually doesn't do anything yet because pcie_failed_link_retrain() is
still a no-op, so this was a bit confusing.

Patch 14 [3] implements pcie_failed_link_retrain(), so the recovery
mentioned in 12 and 13 actually happens.  But this patch doesn't add
the call to pcie_failed_link_retrain(), so it's a little bit hard to
connect the dots.

I agree that as I rearranged it, the workaround doesn't apply in all
cases simultaneously.  Maybe not ideal, but maybe not terrible either.
Looking at it again, maybe it would have made more sense to move the
pcie_wait_for_link_delay() change to the last patch along with the
pci_dev_wait() change.  I dunno.

Bjorn

[1] 12 https://lore.kernel.org/r/alpine.DEB.2.21.2306111619570.64925@angie.orcam.me.uk
[2] 13 https://lore.kernel.org/r/alpine.DEB.2.21.2306111631050.64925@angie.orcam.me.uk
[3] 14 https://lore.kernel.org/r/alpine.DEB.2.21.2305310038540.59226@angie.orcam.me.uk
Maciej W. Rozycki June 20, 2023, 9:54 a.m. UTC | #6
On Fri, 16 Jun 2023, Bjorn Helgaas wrote:

> I agree that as I rearranged it, the workaround doesn't apply in all
> cases simultaneously.  Maybe not ideal, but maybe not terrible either.
> Looking at it again, maybe it would have made more sense to move the
> pcie_wait_for_link_delay() change to the last patch along with the
> pci_dev_wait() change.  I dunno.

 I think the order of the changes is not important enough to justify 
spending a lot of time and mental effort on it.  You decided, so be it.  
Thank you for your effort made with this review.

 With this series out of the way I have now posted a small clean-up for 
SBR code duplication between PCI core and an InfiniBand driver I came 
across in the course of working on this series.  See 
<https://lore.kernel.org/r/alpine.DEB.2.21.2306200153110.14084@angie.orcam.me.uk/>.

 Please have a look at your convenience.

  Maciej
Matthew W Carlis Aug. 6, 2024, 12:06 a.m. UTC | #7
Hello again. I just realized that my first response to this thread two weeks
ago was not actually starting from the end of the discussion. I hope I found
it now... Must say sorry for this I am still figuring out how to follow these
threads.
I need to ask if we can either revert this patch or only modify the quirk to
only run on the device in mention (ASMedia ASM2824). We have now identified
it as causing devices to get stuck at Gen1 in multiple generations of our
hardware & across product lines on ports were hot-plug is common. To be a
little more specific it includes Intel root ports and Broadcomm PCIe switch
ports and also Microchip PCIe switch ports.
The most common place where we see our systems getting stuck at Gen1 is with
device power cycling. If a device is powered on and then off quickly then the
link will of course fail to train & the consequence here is that the port is
forced to Gen1 forever. Does anybody know why the patch will only remove the
forced Gen1 speed from the ASMedia device?

- Matt
Bjorn Helgaas Aug. 6, 2024, 7:36 p.m. UTC | #8
On Mon, Aug 05, 2024 at 06:06:59PM -0600, Matthew W Carlis wrote:
> Hello again. I just realized that my first response to this thread two weeks
> ago was not actually starting from the end of the discussion. I hope I found
> it now... Must say sorry for this I am still figuring out how to follow these
> threads.
> I need to ask if we can either revert this patch or only modify the quirk to
> only run on the device in mention (ASMedia ASM2824). We have now identified
> it as causing devices to get stuck at Gen1 in multiple generations of our
> hardware & across product lines on ports were hot-plug is common. To be a
> little more specific it includes Intel root ports and Broadcomm PCIe switch
> ports and also Microchip PCIe switch ports.
> The most common place where we see our systems getting stuck at Gen1 is with
> device power cycling. If a device is powered on and then off quickly then the
> link will of course fail to train & the consequence here is that the port is
> forced to Gen1 forever. Does anybody know why the patch will only remove the
> forced Gen1 speed from the ASMedia device?

Thanks for keeping this thread alive.  I don't know the fix, but it
does seem like this series made wASMedia ASM2824 work better but
caused regressions elsewhere, so maybe we just need to accept that
ASM2824 is slightly broken and doesn't work as well as it should.

Bjorn
Matthew W Carlis Aug. 7, 2024, 8:43 a.m. UTC | #9
On Tues, 06 Aug 2024 Bjorn Helgaas wrote:
> it does seem like this series made wASMedia ASM2824 work better but
> caused regressions elsewhere, so maybe we just need to accept that
> ASM2824 is slightly broken and doesn't work as well as it should.

One of my colleagues challenged me to provide a more concrete example
where the change will cause problems. One such configuration would be not
implementing the Power Controller Control in the Slot Capabilities Register.
Then, Powering off the slot via out-of-band interfaces would result in the
kernel forcing the DSP to Gen1 100% of the time as far as I can tell. 
The aspect of this force to Gen1 that is the most concerning to my team is
that it isn't cleaned up even if we replaced the EP with some other EP.

I was curious about the PCIe devices mentioned in the commit because I
look at crazy malfunctioning devices too often so I pasted the following:
"Delock Riser Card PCI Expres 41433" into Google. 
I'm not really a physical layer guy, but is it possible that the reported
issue be due to signal integrity? I'm not sure if sending PCIe over a USB
cable is "reliable".

I've never worked with an ASMedia switch and don't have a reliable way to
reproduce anything like the interaction between the two device at hand. As
much as I hate to make the request my thinking is that the patch should be
reverted until there is a solution that doesn't leave the link forced to
Gen1 forever for every EP thereafter.

- Matt
Maciej W. Rozycki Aug. 7, 2024, 11:14 a.m. UTC | #10
On Wed, 7 Aug 2024, Matthew W Carlis wrote:

> > it does seem like this series made wASMedia ASM2824 work better but
> > caused regressions elsewhere, so maybe we just need to accept that
> > ASM2824 is slightly broken and doesn't work as well as it should.
> 
> One of my colleagues challenged me to provide a more concrete example
> where the change will cause problems. One such configuration would be not
> implementing the Power Controller Control in the Slot Capabilities Register.
> Then, Powering off the slot via out-of-band interfaces would result in the
> kernel forcing the DSP to Gen1 100% of the time as far as I can tell. 
> The aspect of this force to Gen1 that is the most concerning to my team is
> that it isn't cleaned up even if we replaced the EP with some other EP.

 Why does that happen?

 For the quirk to trigger, the link has to be down and there has to be the 
LBMS Link Status bit set from link management events as per the PCIe spec 
while the link was previously up, and then both of that while rescanning 
the PCIe device in question, so there's a lot of conditions to meet.  Is 
it the case that in your setup there is no device at this point, but one 
gets plugged in later?

 One aspect to mention here is that when taking a device offline the LBMS 
Link Status bit really ought to be cleared by Linux in the corresponding 
downstream port of the parent device.  As I recall Ilpo was working on a 
broader link bandwidth management subsystem for Linux, which would do that 
among others.  He asked me to make experiments with my problematic machine 
to see if this would interfere, but unrelated issues with DRAM controller 
(now fixed by reducing the DDR clock rate) have prevented me from doing 
that.  I'll see if I can get back to that soon.

> I was curious about the PCIe devices mentioned in the commit because I
> look at crazy malfunctioning devices too often so I pasted the following:
> "Delock Riser Card PCI Expres 41433" into Google. 
> I'm not really a physical layer guy, but is it possible that the reported
> issue be due to signal integrity? I'm not sure if sending PCIe over a USB
> cable is "reliable".

 Well, it's a transmission line and as long as bandwidth and latency 
requirements are met it's as good as any.  My understanding has been one 
of the objectives of PCIe over conventional PCI was to make external links 
such as to expansion boxes easier to implement.

 Please note that it is a purpose-made cable too, rather than just an 
off-the-shelf USB cable.  Then I have solutions deployed with PCIe routed 
over DVI cables too.

 I doubt it could be a signal integrity issue, because once the link has 
negotiated the highest speed possible (Gen2) via this complicated dance it 
works with no issues for months.  I'm not sure what the highest uptime for 
the system in question exactly was, but it was in the range of half a 
year, and I have a network interface downstream which I regularly use for 
heavy NFS traffic in GNU tool chain regression verification, so any issue 
would have shown up pretty quickly.

 Given how switching between speed rates works with PCIe (by establishing 
a link at 2.5GT/s first and then exchanging rates available as data before 
choosing the highest supported by both endpoints) I suspect that it is a 
protocol issue: either or both devices have got it slightly wrong, which 
breaks it when they're combined together.  Otherwise why would retraining 
to 5GT/s by hand work while it doesn't if to be done by hardware itself?  
There is not much if any difference here between both scenarios really.

> I've never worked with an ASMedia switch and don't have a reliable way to
> reproduce anything like the interaction between the two device at hand. As
> much as I hate to make the request my thinking is that the patch should be
> reverted until there is a solution that doesn't leave the link forced to
> Gen1 forever for every EP thereafter.

 I'm working on such a change this week.

 It's just that my primary objectives for this maintenance visit at my lab 
were to fix a pair of broken PSUs (which I have now done) and upgrade the 
firmware of a console manager device to fix an issue with a remote serial 
BREAK feature affecting Magic SysRq (also completed), plus I have a day 
job too that is unrelated.

 I also bought a PCIe to dual M.2 M-Key option card with the ASM2824 
switch onboard to have a different setup to evaluate and determine if this 
issue is specific to the RISC-V board or not.  Unfortunately the ASM2824 
does not bring the downstream ports up if the card is placed in a slot 
that does not supply Vaux (which is a conforming arrangement according to 
the PCIe spec), apparently due to a quirk in the ASM2824 switch according 
to the option card manufacturer, and there is no software workaround 
possible.

 So I won't be able to use this alternative arrangement until I have 
modified the option card, which I won't be able to do before October the 
earliest.  

 I just can't help with that there is so many broken hardware designs out 
there and I have to strive to navigate through and do my job regardless.

  Maciej
Maciej W. Rozycki Aug. 7, 2024, 11:49 a.m. UTC | #11
On Mon, 5 Aug 2024, Matthew W Carlis wrote:

> The most common place where we see our systems getting stuck at Gen1 is with
> device power cycling. If a device is powered on and then off quickly then the
> link will of course fail to train & the consequence here is that the port is
> forced to Gen1 forever. Does anybody know why the patch will only remove the
> forced Gen1 speed from the ASMedia device?

 I know, as the author of the change.

 The reason for this is safety: it's better to have a link run at 2.5GT/s 
than not at all, and from the nature of the issue there is no guarantee 
that if you remove the speed clamp, then the link won't go back into the 
infinite retraining loop that the workaround is supposed to break.

 I was only able to verify that the speed clamp is safe to lift with this 
particular device, but if you hit the actual issue with any other device 
and find that it is safe to remove the clamp as well, then you're welcome 
to add it to the list too.

  Maciej
Oliver O'Halloran Aug. 7, 2024, 12:29 p.m. UTC | #12
On Wed, Aug 7, 2024 at 9:14 PM Maciej W. Rozycki <macro@orcam.me.uk> wrote:
>
> On Wed, 7 Aug 2024, Matthew W Carlis wrote:
>
> > > it does seem like this series made wASMedia ASM2824 work better but
> > > caused regressions elsewhere, so maybe we just need to accept that
> > > ASM2824 is slightly broken and doesn't work as well as it should.
> >
> > One of my colleagues challenged me to provide a more concrete example
> > where the change will cause problems. One such configuration would be not
> > implementing the Power Controller Control in the Slot Capabilities Register.
> > Then, Powering off the slot via out-of-band interfaces would result in the
> > kernel forcing the DSP to Gen1 100% of the time as far as I can tell.
> > The aspect of this force to Gen1 that is the most concerning to my team is
> > that it isn't cleaned up even if we replaced the EP with some other EP.
>
>  Why does that happen?
>
>  For the quirk to trigger, the link has to be down and there has to be the
> LBMS Link Status bit set from link management events as per the PCIe spec
> while the link was previously up, and then both of that while rescanning
> the PCIe device in question, so there's a lot of conditions to meet.  Is
> it the case that in your setup there is no device at this point, but one
> gets plugged in later?

My read was that Matt is essentially doing a surprise hot-unplug by
removing power to the card without notifying the OS. I thought the
LBMS bit wouldn't be set in that case since the link goes down rather
than changes speed, but the spec is a little vague and that appears to
be happening in Matt's testing. It might be worth disabling the
workaround if the port has the surprise hotplug capability bit set.
It's fairly common for ports on NVMe drive backplanes to have it set
and a lot of people would be unhappy about those being forced to Gen 1
by accident.
Matthew W Carlis Aug. 8, 2024, 2:07 a.m. UTC | #13
On Wed, 7 Aug 2024 22:29:35 +1000 Oliver O'Halloran Wrote
> My read was that Matt is essentially doing a surprise hot-unplug by
> removing power to the card without notifying the OS. I thought the
> LBMS bit wouldn't be set in that case since the link goes down rather
> than changes speed, but the spec is a little vague and that appears to
> be happening in Matt's testing. It might be worth disabling the
> workaround if the port has the surprise hotplug capability bit set.

Most of the systems I have are using downstream port containment which does
not recommend setting the Hot-Plug Surprise in Slot Capabilities & therefore
we do not. The first time we noticed an issue with this patch was in test
automation which was power cycling the endpoints & injecting uncorrectable
errors to ensure our hosts are robust in the face of PCIe chaos & that they
will recover. Later we started to see other teams on other products
encountering the same bug in simpler cases where humans turn on and off
EP power for development purposes. Although we are not using Hot-Plug Surprise
we are often triggering DPC on the Surprise Down Uncorrectable Error when
we hit this Gen1 issue.

On Wed, 7 Aug 2024 12:14:13 +0100 Maciej W. Rozycki Wrote
> For the quirk to trigger, the link has to be down and there has to be the
> LBMS Link Status bit set from link management events as per the PCIe spec
> while the link was previously up, and then both of that while rescanning
> the PCIe device in question, so there's a lot of conditions to meet.

If there is nothing clearing the bit then why is there any expectation that
it wouldn't be set? We have 20/30/40 endpoints in a host that can be hot-removed,
hot-added at any point in time in any combination & its often the case that
the system uptime be hundreds of days. Eventually the bit will just become set
as a result of time and scale.

On Wed, 7 Aug 2024 12:14:13 +0100 Maciej W. Rozycki Wrote
> The reason for this is safety: it's better to have a link run at 2.5GT/s
> than not at all, and from the nature of the issue there is no guarantee
> that if you remove the speed clamp, then the link won't go back into the
> infinite retraining loop that the workaround is supposed to break.

I guess I don't really understand why it wouldn't be safe for every device
other than the ASMedia since they aren't the device that had the issue in the
first place. The main problem in my mind is the system doesn't actually have to
be retraining at all, it can occur the first time you replace a device or
power cycle the device or the first time the device goes into DPC & comes back.
If the quirk simply returned without doing anything when the ASMedia is not in the
allow-list it would make me more at ease. I can also imagine some other implementations
that would work well, but it doesn't seem correct that we could only give a single
opportunity to a device before forcing it to live at Gen1. Perhaps it should be
aware of the rate or a count or something...

I can only assume that there will be more systems that start to run into issues
with the patch as they strive to keep up with LTS & they exercise the hot-plug
path.
Oliver O'Halloran Aug. 8, 2024, 11:13 p.m. UTC | #14
On Thu, Aug 8, 2024 at 12:08 PM Matthew W Carlis <mattc@purestorage.com> wrote:
>
> On Wed, 7 Aug 2024 22:29:35 +1000 Oliver O'Halloran Wrote
> > My read was that Matt is essentially doing a surprise hot-unplug by
> > removing power to the card without notifying the OS. I thought the
> > LBMS bit wouldn't be set in that case since the link goes down rather
> > than changes speed, but the spec is a little vague and that appears to
> > be happening in Matt's testing. It might be worth disabling the
> > workaround if the port has the surprise hotplug capability bit set.
>
> Most of the systems I have are using downstream port containment which does
> not recommend setting the Hot-Plug Surprise in Slot Capabilities & therefore
> we do not. The first time we noticed an issue with this patch was in test
> automation which was power cycling the endpoints & injecting uncorrectable
> errors to ensure our hosts are robust in the face of PCIe chaos & that they
> will recover. Later we started to see other teams on other products
> encountering the same bug in simpler cases where humans turn on and off
> EP power for development purposes.

Ok? If we have to check for DPC being enabled in addition to checking
the surprise bit in the slot capabilities then that's fine, we can do
that. The question to be answered here is: how should this feature
work on ports where it's normal for a device to be removed without any
notice?
Maciej W. Rozycki Aug. 9, 2024, 1:34 p.m. UTC | #15
On Wed, 7 Aug 2024, Matthew W Carlis wrote:

> > For the quirk to trigger, the link has to be down and there has to be the
> > LBMS Link Status bit set from link management events as per the PCIe spec
> > while the link was previously up, and then both of that while rescanning
> > the PCIe device in question, so there's a lot of conditions to meet.
> 
> If there is nothing clearing the bit then why is there any expectation that
> it wouldn't be set? We have 20/30/40 endpoints in a host that can be hot-removed,
> hot-added at any point in time in any combination & its often the case that
> the system uptime be hundreds of days. Eventually the bit will just become set
> as a result of time and scale.

 Well, in principle in a setup with reliable links the LBMS bit may never 
be set, e.g. this system of mine has been in 24/7 operation since the last 
reboot 410 days ago and for the devices that support Link Active reporting 
it shows:

LnkSta:Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt-
LnkSta:Speed 5GT/s, Width x8, TrErr- Train- SlotClk- DLActive+ BWMgmt+ ABWMgmt+
LnkSta:Speed 5GT/s, Width x1, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt+
LnkSta:Speed 5GT/s, Width x2, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt+
LnkSta:Speed 5GT/s, Width x1, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt+
LnkSta:Speed 8GT/s, Width x16, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt+
LnkSta:Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
LnkSta:Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
LnkSta:Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
LnkSta:Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
LnkSta:Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt-

so out of 11 devices 6 have the LBMS bit clear.  But then 5 have it set, 
perhaps worryingly, so of course you're right, that it will get set in the 
field, though it's not enough by itself for your problem to trigger.

 Then there is manual link retraining, which we do from time to time as 
well, which will set the bit too, which I overlooked.

 To cut the story short, it was an oversight of mine, especially as I 
don't really make any use myself of hot plug scenarios.

> > The reason for this is safety: it's better to have a link run at 2.5GT/s
> > than not at all, and from the nature of the issue there is no guarantee
> > that if you remove the speed clamp, then the link won't go back into the
> > infinite retraining loop that the workaround is supposed to break.
> 
> I guess I don't really understand why it wouldn't be safe for every device
> other than the ASMedia since they aren't the device that had the issue in the
> first place. The main problem in my mind is the system doesn't actually have to
> be retraining at all, it can occur the first time you replace a device or
> power cycle the device or the first time the device goes into DPC & comes back.
> If the quirk simply returned without doing anything when the ASMedia is not in the
> allow-list it would make me more at ease. I can also imagine some other implementations
> that would work well, but it doesn't seem correct that we could only give a single
> opportunity to a device before forcing it to live at Gen1. Perhaps it should be
> aware of the rate or a count or something...

 It's a complex matter.  For a starter please read the introduction at 
`pcie_failed_link_retrain'.

 When the problem triggers the link goes into an infinite link training 
loop, with the Link Training (LT) bit flipping on and off.  I've made a 
complementary workaround for U-Boot (since your bootstrap device can be 
downstream such a broken link), where a statistical probe is done for the 
LT bit flipping as discovered by polling the bit in a tight loop.  This is 
fine for a piece of firmware that has nothing better to do, but in an OS 
kernel we just can't do it.

 Also U-Boot does not remove the 2.5GT/s clamp because it does not have to 
set up the system in an optimal way, but just sufficiently to successfully 
boot.  An OS kernel can then adjust the configuration if it knows about 
the workaround, or leave it as it is.  In the latter case things will 
continue working across PCIe resets, because the TLS field is sticky.

 For Linux I went for just checking the DLLLA bit as it seems good enough 
for devices that support Link Active reporting.  And I admit that the 
workaround is quite aggressive, but this was a deliberate decision 
following the robustness principle: the end user may not be qualified 
enough to diagnose the problem and given its nature not even think it's 
something that can be made to work.  The downstream device does not show 
up in the PCIe hierarchy and just looks like it's broken.  It takes a lot 
of knowledge and experience to notice the LT bit flipping and even myself, 
quite a seasoned Linux kernel developer, only noticed it by chance.

 It was discussed to death with the original submission, the rationale is 
given in the introductory comment and I'd prefer that it stays as it is.  
The ASM2824 switch works just fine with other downstream devices and so 
does the PI7C9X2G304 switch with other upstream devices.  Both vendors 
refused to help narrow it down and declined to comment (a person from 
Diodes née Pericom replied, but they learnt it's about a PCIe switch they 
told me they were from another department and couldn't help), which would 
have helped (I now just recommend staying away from both manufacturers).  
Therefore my assumption is this can potentially trigger with any pair of 
devices.

 I have now posted a patch series to address this problem of yours and a 
previous issue reported by Ilpo.  See: 
<https://patchwork.kernel.org/project/linux-pci/list/?series=878216>, and 
I've included you in the list of direct recipients too.

 I will appreciate if you give it a try, and again, apologies for the 
trouble caused, but such things happen in engineering.

  Maciej
Matthew W Carlis Aug. 15, 2024, 7:40 p.m. UTC | #16
Sorry for the delay in my responses here I had some things get in my way.

On Fri, 9 Aug 2024 09:13:52 Oliver O'Halloran <oohall@gmail.com> wrote:

> Ok? If we have to check for DPC being enabled in addition to checking
> the surprise bit in the slot capabilities then that's fine, we can do
> that. The question to be answered here is: how should this feature
> work on ports where it's normal for a device to be removed without any
> notice?

I'm not sure if its the correct thing to check however. I assumed that ports
using the pciehp driver would usually consider it "normal" for a device to
be removed actually, but maybe I have the idea of hp reversed.

On Fri, 9 Aug 2024 14:34:04 Maciej W. Rozycki <macro@orcam.me.uk> wrote:

> Well, in principle in a setup with reliable links the LBMS bit may never 
> be set, e.g. this system of mine has been in 24/7 operation since the last 
> reboot 410 days ago and for the devices that support Link Active reporting 
> it shows:
> ...
> so out of 11 devices 6 have the LBMS bit clear.  But then 5 have it set, 
> perhaps worryingly, so of course you're right, that it will get set in the 
> field, though it's not enough by itself for your problem to trigger.

The way I look at it is that its essentially a probability distribution with time,
but I try to avoid learning too much about the physical layer because I would find
myself debugging more hardware issues lol. I also don't think LBMS/LABS being set
by itself is very interesting without knowing the rate at which it is being set.
FWIW I have seen some devices in the past going into recovery state many times a
second & still never downtrain, but at the same time they were setting the
LBMS/LABS bits which maybe not quite spec compliant.

I would like to help test these changes, but I would like to avoid having to test
each mentioned change individually. Does anyone have any preferences in how I batch
the patches for testing? Would it be ok if I just pulled them all together on one go?

- Matt
Maciej W. Rozycki Aug. 16, 2024, 1:57 p.m. UTC | #17
On Thu, 15 Aug 2024, Matthew W Carlis wrote:

> > Well, in principle in a setup with reliable links the LBMS bit may never 
> > be set, e.g. this system of mine has been in 24/7 operation since the last 
> > reboot 410 days ago and for the devices that support Link Active reporting 
> > it shows:
> > ...
> > so out of 11 devices 6 have the LBMS bit clear.  But then 5 have it set, 
> > perhaps worryingly, so of course you're right, that it will get set in the 
> > field, though it's not enough by itself for your problem to trigger.
> 
> The way I look at it is that its essentially a probability distribution with time,
> but I try to avoid learning too much about the physical layer because I would find
> myself debugging more hardware issues lol. I also don't think LBMS/LABS being set
> by itself is very interesting without knowing the rate at which it is being set.

 Agreed.  Ilpo's upcoming bandwidth controller will hopefully give us such 
data.

> FWIW I have seen some devices in the past going into recovery state many times a
> second & still never downtrain, but at the same time they were setting the
> LBMS/LABS bits which maybe not quite spec compliant.
> 
> I would like to help test these changes, but I would like to avoid having to test
> each mentioned change individually. Does anyone have any preferences in how I batch
> the patches for testing? Would it be ok if I just pulled them all together on one go?

 Certainly fine with me, especially as 3/4 and 4/4 aren't really related 
to your failure scenario, and then you need 1/4 and 2/4 both at a time to 
address both aspects of the issue you have reported.

  Maciej
Matthew W Carlis Oct. 1, 2024, 9:04 p.m. UTC | #18
I just wanted to follow up with our testing results for the mentioned
patches. It took me a while to get them running in our test pool, but
we just got it going yesterday and the initial results look really good.
We will continue running them in our testing from now on & if any issues
come up I'll try to report them, but otherwise I wanted to say thank you
for entertaining the discussion & reacting so quickly with new patches.

The patches we pulled into our testing:

[PATCH v3 1/4] PCI: Clear the LBMS bit after a link retrain
[PATCH v3 2/4] PCI: Revert to the original speed after PCIe failed link retraining

- Matt
Maciej W. Rozycki Oct. 2, 2024, 12:58 p.m. UTC | #19
On Tue, 1 Oct 2024, Matthew W Carlis wrote:

> I just wanted to follow up with our testing results for the mentioned
> patches. It took me a while to get them running in our test pool, but
> we just got it going yesterday and the initial results look really good.
> We will continue running them in our testing from now on & if any issues
> come up I'll try to report them, but otherwise I wanted to say thank you
> for entertaining the discussion & reacting so quickly with new patches.

 My pleasure.  I'm glad that the solution works for you.  Let me know if 
you need anything else.

  Maciej
Bjorn Helgaas Oct. 2, 2024, 8:55 p.m. UTC | #20
On Wed, Oct 02, 2024 at 01:58:15PM +0100, Maciej W. Rozycki wrote:
> On Tue, 1 Oct 2024, Matthew W Carlis wrote:
> 
> > I just wanted to follow up with our testing results for the mentioned
> > patches. It took me a while to get them running in our test pool, but
> > we just got it going yesterday and the initial results look really good.
> > We will continue running them in our testing from now on & if any issues
> > come up I'll try to report them, but otherwise I wanted to say thank you
> > for entertaining the discussion & reacting so quickly with new patches.
> 
>  My pleasure.  I'm glad that the solution works for you.  Let me know if 
> you need anything else.

If there's anything missing that still needs to be added to v6.13-rc1,
can somebody repost those?  I lost track of what's still outstanding.

Bjorn
Maciej W. Rozycki Oct. 3, 2024, 10:39 a.m. UTC | #21
On Wed, 2 Oct 2024, Bjorn Helgaas wrote:

> If there's anything missing that still needs to be added to v6.13-rc1,
> can somebody repost those?  I lost track of what's still outstanding.

 I have nothing outstanding to add right away.  Thank you for asking.

  Maciej