diff mbox series

[RFC] PCI: Fix the issue of link speed downgrade after link retraining

Message ID 20241107143758.12643-1-guojinhui.liam@bytedance.com
State New
Headers show
Series [RFC] PCI: Fix the issue of link speed downgrade after link retraining | expand

Commit Message

Jinhui Guo Nov. 7, 2024, 2:37 p.m. UTC
The link speed is downgraded to 2.5 GT/s when a Samsung NVMe device
is hotplugged into a Intel PCIe root port [8086:0db0].

```
+-[0000:3c]-+-00.0  Intel Corporation Ice Lake Memory Map/VT-d
|           ...
|           +02.0-[3d]----00.0  Samsung Electronics Co Ltd Device a80e
```

Some printing information can be obtained when the issue emerges.
"Card present" is reported twice via external interrupts due to
a slight tremor when the Samsung NVMe device is plugged in.
The failure of the link activation for the first time leads to
the link speed of the root port being mistakenly downgraded to 2.5G/s.

```
[ 8223.419682] pcieport 0000:3d:02.0: pciehp: Slot(1): Card present
[ 8224.449714] pcieport 0000:3d:02.0: broken device, retraining non-functional downstream link at 2.5GT/s
[ 8225.518723] pcieport 0000:3d:02.0: pciehp: Slot(1): Card present
[ 8225.518726] pcieport 0000:3d:02.0: pciehp: Slot(1): Link up
```

To avoid wrongly setting the link speed to 2.5GT/s, only allow
specific pcie devices to perform link retrain.

Fixes: a89c82249c37 ("PCI: Work around PCIe link training failures")
Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
---
 drivers/pci/quirks.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Bjorn Helgaas Nov. 7, 2024, 3:34 p.m. UTC | #1
[+cc Lukas, -cc stable]

On Thu, Nov 07, 2024 at 10:37:58PM +0800, Jinhui Guo wrote:
> The link speed is downgraded to 2.5 GT/s when a Samsung NVMe device
> is hotplugged into a Intel PCIe root port [8086:0db0].
> 
> ```
> +-[0000:3c]-+-00.0  Intel Corporation Ice Lake Memory Map/VT-d
> |           ...
> |           +02.0-[3d]----00.0  Samsung Electronics Co Ltd Device a80e
> ```
> 
> Some printing information can be obtained when the issue emerges.
> "Card present" is reported twice via external interrupts due to
> a slight tremor when the Samsung NVMe device is plugged in.
> The failure of the link activation for the first time leads to
> the link speed of the root port being mistakenly downgraded to 2.5G/s.
> 
> ```
> [ 8223.419682] pcieport 0000:3d:02.0: pciehp: Slot(1): Card present
> [ 8224.449714] pcieport 0000:3d:02.0: broken device, retraining non-functional downstream link at 2.5GT/s
> [ 8225.518723] pcieport 0000:3d:02.0: pciehp: Slot(1): Card present
> [ 8225.518726] pcieport 0000:3d:02.0: pciehp: Slot(1): Link up
> ```

No need for markdown in commit logs.  Drop timestamps (I don't think
they are telling us anything useful here).  Indent quoted material a
couple spaces.

Nothing here looks specific to the Intel Root Port or the NVMe device;
I assume any hot-added device could see the same problem, at least
with pciehp.

> To avoid wrongly setting the link speed to 2.5GT/s, only allow
> specific pcie devices to perform link retrain.

s/pcie/PCIe/

> Fixes: a89c82249c37 ("PCI: Work around PCIe link training failures")
> Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
> ---
>  drivers/pci/quirks.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index dccb60c1d9cc..59858156003b 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -91,7 +91,8 @@ int pcie_failed_link_retrain(struct pci_dev *dev)
>  	int ret = -ENOTTY;
>  
>  	if (!pci_is_pcie(dev) || !pcie_downstream_port(dev) ||
> -	    !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting)
> +	    !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting ||
> +		!pci_match_id(ids, dev))
>  		return ret;
>  
>  	pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2);
> @@ -119,8 +120,7 @@ int pcie_failed_link_retrain(struct pci_dev *dev)
>  	}
>  
>  	if ((lnksta & PCI_EXP_LNKSTA_DLLLA) &&
> -	    (lnkctl2 & PCI_EXP_LNKCTL2_TLS) == PCI_EXP_LNKCTL2_TLS_2_5GT &&
> -	    pci_match_id(ids, dev)) {
> +	    (lnkctl2 & PCI_EXP_LNKCTL2_TLS) == PCI_EXP_LNKCTL2_TLS_2_5GT) {
>  		u32 lnkcap;
>  
>  		pci_info(dev, "removing 2.5GT/s downstream link speed restriction\n");
> -- 
> 2.20.1
>
Lukas Wunner Nov. 8, 2024, 11:11 a.m. UTC | #2
[+cc Maciej, Ilpo]

On Thu, Nov 07, 2024 at 09:34:38AM -0600, Bjorn Helgaas wrote:
> On Thu, Nov 07, 2024 at 10:37:58PM +0800, Jinhui Guo wrote:
> > The link speed is downgraded to 2.5 GT/s when a Samsung NVMe device
> > is hotplugged into a Intel PCIe root port [8086:0db0].
> > 
> > ```
> > +-[0000:3c]-+-00.0  Intel Corporation Ice Lake Memory Map/VT-d
> > |           ...
> > |           +02.0-[3d]----00.0  Samsung Electronics Co Ltd Device a80e
> > ```
> > 
> > Some printing information can be obtained when the issue emerges.
> > "Card present" is reported twice via external interrupts due to
> > a slight tremor when the Samsung NVMe device is plugged in.
> > The failure of the link activation for the first time leads to
> > the link speed of the root port being mistakenly downgraded to 2.5G/s.
> > 
> > ```
> > [ 8223.419682] pcieport 0000:3d:02.0: pciehp: Slot(1): Card present
> > [ 8224.449714] pcieport 0000:3d:02.0: broken device, retraining non-functional downstream link at 2.5GT/s
> > [ 8225.518723] pcieport 0000:3d:02.0: pciehp: Slot(1): Card present
> > [ 8225.518726] pcieport 0000:3d:02.0: pciehp: Slot(1): Link up
> > ```
> > 
> > To avoid wrongly setting the link speed to 2.5GT/s, only allow
> > specific pcie devices to perform link retrain.

With which kernel version are you seeing this?

A set of fixes for the 2.5GT/s retraining feature appeared in v6.12-rc1,
specifically f68dea13405c ("PCI: Revert to the original speed after PCIe
failed link retraining").

Have you tested whether the latest v6.12 rc is still affected?

Thanks,

Lukas

> > Fixes: a89c82249c37 ("PCI: Work around PCIe link training failures")
> > Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
> > ---
> >  drivers/pci/quirks.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index dccb60c1d9cc..59858156003b 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -91,7 +91,8 @@ int pcie_failed_link_retrain(struct pci_dev *dev)
> >  	int ret = -ENOTTY;
> >  
> >  	if (!pci_is_pcie(dev) || !pcie_downstream_port(dev) ||
> > -	    !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting)
> > +	    !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting ||
> > +		!pci_match_id(ids, dev))
> >  		return ret;
> >  
> >  	pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2);
> > @@ -119,8 +120,7 @@ int pcie_failed_link_retrain(struct pci_dev *dev)
> >  	}
> >  
> >  	if ((lnksta & PCI_EXP_LNKSTA_DLLLA) &&
> > -	    (lnkctl2 & PCI_EXP_LNKCTL2_TLS) == PCI_EXP_LNKCTL2_TLS_2_5GT &&
> > -	    pci_match_id(ids, dev)) {
> > +	    (lnkctl2 & PCI_EXP_LNKCTL2_TLS) == PCI_EXP_LNKCTL2_TLS_2_5GT) {
> >  		u32 lnkcap;
> >  
> >  		pci_info(dev, "removing 2.5GT/s downstream link speed restriction\n");
> > -- 
> > 2.20.1
Maciej W. Rozycki Nov. 11, 2024, 12:50 p.m. UTC | #3
On Fri, 8 Nov 2024, Lukas Wunner wrote:

> > > Some printing information can be obtained when the issue emerges.
> > > "Card present" is reported twice via external interrupts due to
> > > a slight tremor when the Samsung NVMe device is plugged in.

 What do you mean by "a slight tremor"?  Do the devices involved fail to 
negotiate link in the time prescribed by the PCIe specification?  Why is 
the interrupt sent twice?

> > > To avoid wrongly setting the link speed to 2.5GT/s, only allow
> > > specific pcie devices to perform link retrain.
> 
> With which kernel version are you seeing this?
> 
> A set of fixes for the 2.5GT/s retraining feature appeared in v6.12-rc1,
> specifically f68dea13405c ("PCI: Revert to the original speed after PCIe
> failed link retraining").
> 
> Have you tested whether the latest v6.12 rc is still affected?

 Thanks for chiming in.  I do hope the fixes will have addressed issues 
like one concerned here.

 NB I have been fully booked recently due to an upcoming patch submission 
deadline and will continue to be throughout this week, so I may be slow to 
response.

  Maciej
diff mbox series

Patch

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index dccb60c1d9cc..59858156003b 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -91,7 +91,8 @@  int pcie_failed_link_retrain(struct pci_dev *dev)
 	int ret = -ENOTTY;
 
 	if (!pci_is_pcie(dev) || !pcie_downstream_port(dev) ||
-	    !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting)
+	    !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting ||
+		!pci_match_id(ids, dev))
 		return ret;
 
 	pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2);
@@ -119,8 +120,7 @@  int pcie_failed_link_retrain(struct pci_dev *dev)
 	}
 
 	if ((lnksta & PCI_EXP_LNKSTA_DLLLA) &&
-	    (lnkctl2 & PCI_EXP_LNKCTL2_TLS) == PCI_EXP_LNKCTL2_TLS_2_5GT &&
-	    pci_match_id(ids, dev)) {
+	    (lnkctl2 & PCI_EXP_LNKCTL2_TLS) == PCI_EXP_LNKCTL2_TLS_2_5GT) {
 		u32 lnkcap;
 
 		pci_info(dev, "removing 2.5GT/s downstream link speed restriction\n");