[v3] capi: Poll Err/Status register during CAPP recovery

Message ID	20180319041535.27251-1-vaibhav@linux.vnet.ibm.com
State	Superseded
Headers	show Return-Path: <skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org> Gateway: Authorized Use Only! Violators will be prosecuted for <skiboot@lists.ozlabs.org> from <vaibhav@linux.vnet.ibm.com>; Mon, 19 Mar 2018 04:16:05 -0000 Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 19 Mar 2018 04:16:02 -0000 From: Vaibhav Jain <vaibhav@linux.vnet.ibm.com> To: Christophe Lombard <christophe_lombard@fr.ibm.com>, Frederic Barrat <fbarrat@linux.vnet.ibm.com> Date: Mon, 19 Mar 2018 09:45:35 +0530 Message-Id: <20180319041535.27251-1-vaibhav@linux.vnet.ibm.com> Subject: [Skiboot] [PATCH v3] capi: Poll Err/Status register during CAPP recovery Precedence: list Cc: Philippe Bergheaud <philippe.bergheaud@fr.ibm.com>, Andrew Donnellan <andrew.donnellan@au1.ibm.com>, skiboot@lists.ozlabs.org MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" <skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org>
Series	[v3] capi: Poll Err/Status register during CAPP recovery \| expand [v3] capi: Poll Err/Status register during CAPP recovery

Message ID

20180319041535.27251-1-vaibhav@linux.vnet.ibm.com

State

Superseded

Headers

From: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
To: Christophe Lombard <christophe_lombard@fr.ibm.com>,
	Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Date: Mon, 19 Mar 2018 09:45:35 +0530
Message-Id: <20180319041535.27251-1-vaibhav@linux.vnet.ibm.com>
Subject: [Skiboot] [PATCH v3] capi: Poll Err/Status register during CAPP
	recovery
Precedence: list
Cc: Philippe Bergheaud <philippe.bergheaud@fr.ibm.com>,
	Andrew Donnellan <andrew.donnellan@au1.ibm.com>, skiboot@lists.ozlabs.org
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org
Sender: "Skiboot"
	<skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org>

Series

[v3] capi: Poll Err/Status register during CAPP recovery | expand

Commit Message

Vaibhav Jain March 19, 2018, 4:15 a.m. UTC

This patch updates do_capp_recovery_scoms() to poll the CAPP
Err/Status control register, check for CAPP-Recovery to complete/fail
based on indications of BITS-1,5,9 and then proceed with the
CAPP-Recovery scoms iif recovery completed successfully. This would
prevent cases where we bring-up the PCIe link while recovery sequencer
on CAPP is still busy with casting out cache lines.

In case CAPP-Recovery didn't complete successfully an error is returned
from do_capp_recovery_scoms() asking phb4_creset() to keep the phb4
fenced and mark it as broken.

The loop that implements polling of Err/Status register will also log
an error on the PHB when it continues for more than 168ms which is the
max time to failure for CAPP-Recovery.

Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
---
Changelog:
v3 -> Introduced a timeout for the Poll loop of 168ms [Christophe]

v2 -> Added an extra check for Bit(0) in Err/Status reg at the
      beginning to check if recovery mode was entered. [Christophe]
---
 hw/phb4.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 68 insertions(+), 17 deletions(-)

Comments

Alastair D'Silva March 19, 2018, 4:24 a.m. UTC | #1

On Mon, 2018-03-19 at 09:45 +0530, Vaibhav Jain wrote:
> This patch updates do_capp_recovery_scoms() to poll the CAPP
> Err/Status control register, check for CAPP-Recovery to complete/fail
> based on indications of BITS-1,5,9 and then proceed with the
> CAPP-Recovery scoms iif recovery completed successfully. This would
> prevent cases where we bring-up the PCIe link while recovery
> sequencer
> on CAPP is still busy with casting out cache lines.
> 
> In case CAPP-Recovery didn't complete successfully an error is
> returned
> from do_capp_recovery_scoms() asking phb4_creset() to keep the phb4
> fenced and mark it as broken.
> 
> The loop that implements polling of Err/Status register will also log
> an error on the PHB when it continues for more than 168ms which is
> the
> max time to failure for CAPP-Recovery.
> 
> Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
> ---
> Changelog:
> v3 -> Introduced a timeout for the Poll loop of 168ms [Christophe]
> 
> v2 -> Added an extra check for Bit(0) in Err/Status reg at the
>       beginning to check if recovery mode was entered. [Christophe]
> ---
>  hw/phb4.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++---
> ----------
>  1 file changed, 68 insertions(+), 17 deletions(-)
> 
> diff --git a/hw/phb4.c b/hw/phb4.c
> index 47175df2..30f46f9a 100644
> --- a/hw/phb4.c
> +++ b/hw/phb4.c
> @@ -2857,25 +2857,75 @@ static int64_t load_capp_ucode(struct phb4
> *p)
>  	return rc;
>  }
> 
> -static void do_capp_recovery_scoms(struct phb4 *p)
> +static int do_capp_recovery_scoms(struct phb4 *p)
>  {
> -	uint64_t reg;
> -	uint32_t offset;
> +	uint64_t rc, reg, end;
> +	uint64_t offset = PHB4_CAPP_REG_OFFSET(p);
> 
> -	PHBDBG(p, "Doing CAPP recovery scoms\n");
> -
> -	offset = PHB4_CAPP_REG_OFFSET(p);
> -	/* disable snoops */
> -	xscom_write(p->chip_id, SNOOP_CAPI_CONFIG + offset, 0);
> -	load_capp_ucode(p);
> -	/* clear err rpt reg*/
> -	xscom_write(p->chip_id, CAPP_ERR_RPT_CLR + offset, 0);
> -	/* clear capp fir */
> -	xscom_write(p->chip_id, CAPP_FIR + offset, 0);
> 
> +	/* Get the status of CAPP recovery */
>  	xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, &reg);
> -	reg &= ~(PPC_BIT(0) | PPC_BIT(1));
> -	xscom_write(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, reg);
> +
> +	/* No recovery in progress ignore */
> +	if ((reg & PPC_BIT(0)) == 0) {
> +		PHBDBG(p, "CAPP: No recovery in progress\n");
> +		return 0;
Did you mean OPAL_SUCCESS here? Other return points in the function use
the OPAL_* macros.

> +	}
> +
> +	PHBDBG(p, "CAPP: Waiting for recovery to complete\n");
> +	/* recovery timer failure period 168ms */
> +	end = mftb() + msecs_to_tb(168);
> +	while ((reg & (PPC_BIT(1) | PPC_BIT(5) | PPC_BIT(9))) == 0)
> {
> +
> +		time_wait_ms(5);
> +		xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL +
> offset, &reg);
> +
> +		if (end && tb_compare(mftb(), end) != TB_AAFTERB) {
> +			PHBERR(p, "CAPP: Capp recovery Timed-
> out.\n");
> +			end = 0;
> +			break;
> +		}
> +	}
> +
> +	/* Check if the recovery failed or passed */
> +	if (reg & PPC_BIT(1)) {
> +		PHBDBG(p, "Doing CAPP recovery scoms\n");
> +		/* disable snoops */
> +		xscom_write(p->chip_id, SNOOP_CAPI_CONFIG + offset,
> 0);
> +		load_capp_ucode(p);
> +
> +		/* clear err rpt reg*/
> +		xscom_write(p->chip_id, CAPP_ERR_RPT_CLR + offset,
> 0);
> +
> +		/* clear capp fir */
> +		xscom_write(p->chip_id, CAPP_FIR + offset, 0);
> +
> +		/* Just reset Bit-0,1 and dont touch any other bit
> */
> +		xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL +
> offset, &reg);
> +		reg &= ~(PPC_BIT(0) | PPC_BIT(1));
> +		xscom_write(p->chip_id, CAPP_ERR_STATUS_CTRL +
> offset, reg);
> +
> +		PHBDBG(p, "CAPP recovery complete\n");
> +		rc = OPAL_SUCCESS;
> +
> +	} else {
> +		/* Most likely will checkstop here due to FIR ACTION
> for
> +		 * failed recovery. So this message would never be
> logged.
> +		 * But if we still enter here then return an error
> forcing a
> +		 * fence of the PHB.
> +		 */
> +		if (reg  & PPC_BIT(5))
> +			PHBERR(p, "CAPP: Capp recovery Failed\n");
> +		else if (reg  & PPC_BIT(9))
> +			PHBERR(p, "CAPP: Capp recovery hang
> detected\n");
> +		else if (end != 0)
> +			PHBERR(p, "CAPP: Unknown recovery
> failure\n");
> +
> +		PHBDBG(p, "CAPP: Err/Status-reg=0x%016llx\n", reg);
> +		rc = OPAL_HARDWARE;
> +	}
> +
> +	return rc;
>  }
> 
>  static int64_t phb4_creset(struct pci_slot *slot)
> @@ -2934,8 +2984,9 @@ static int64_t phb4_creset(struct pci_slot
> *slot)
>  			PHBDBG(p, "CRESET: No pending
> transactions\n");
> 
>  			/* capp recovery */
> -			if (p->flags & PHB4_CAPP_RECOVERY)
> -				do_capp_recovery_scoms(p);
> +			if (p->flags & PHB4_CAPP_RECOVERY &&
> +			    do_capp_recovery_scoms(p))
> +				goto error;
> 
>  			/* Clear errors in PFIR and NFIR */
>  			xscom_write(p->chip_id, p->pci_stk_xscom +
> 0x1,
>

Alastair D'Silva March 19, 2018, 4:30 a.m. UTC | #2

This patch updates do_capp_recovery_scoms() to poll the CAPP
> Err/Status control register, check for CAPP-Recovery to complete/fail
> based on indications of BITS-1,5,9 and then proceed with the
> CAPP-Recovery scoms iif recovery completed successfully. This would
> prevent cases where we bring-up the PCIe link while recovery
> sequencer
> on CAPP is still busy with casting out cache lines.
> 
> In case CAPP-Recovery didn't complete successfully an error is
> returned
> from do_capp_recovery_scoms() asking phb4_creset() to keep the phb4
> fenced and mark it as broken.
> 
> The loop that implements polling of Err/Status register will also log
> an error on the PHB when it continues for more than 168ms which is
> the
> max time to failure for CAPP-Recovery.
> 
> Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
> ---
> Changelog:
> v3 -> Introduced a timeout for the Poll loop of 168ms [Christophe]
> 
> v2 -> Added an extra check for Bit(0) in Err/Status reg at the
>       beginning to check if recovery mode was entered. [Christophe]
> ---
>  hw/phb4.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++---
> ----------
>  1 file changed, 68 insertions(+), 17 deletions(-)
> 
> diff --git a/hw/phb4.c b/hw/phb4.c
> index 47175df2..30f46f9a 100644
> --- a/hw/phb4.c
> +++ b/hw/phb4.c
> @@ -2857,25 +2857,75 @@ static int64_t load_capp_ucode(struct phb4
> *p)
>  	return rc;
>  }
> 
> -static void do_capp_recovery_scoms(struct phb4 *p)
> +static int do_capp_recovery_scoms(struct phb4 *p)
>  {
> -	uint64_t reg;
> -	uint32_t offset;
> +	uint64_t rc, reg, end;
> +	uint64_t offset = PHB4_CAPP_REG_OFFSET(p);
> 
> -	PHBDBG(p, "Doing CAPP recovery scoms\n");
> -
> -	offset = PHB4_CAPP_REG_OFFSET(p);
> -	/* disable snoops */
> -	xscom_write(p->chip_id, SNOOP_CAPI_CONFIG + offset, 0);
> -	load_capp_ucode(p);
> -	/* clear err rpt reg*/
> -	xscom_write(p->chip_id, CAPP_ERR_RPT_CLR + offset, 0);
> -	/* clear capp fir */
> -	xscom_write(p->chip_id, CAPP_FIR + offset, 0);
> 
> +	/* Get the status of CAPP recovery */
>  	xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, &reg);
> -	reg &= ~(PPC_BIT(0) | PPC_BIT(1));
> -	xscom_write(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, reg);
> +
> +	/* No recovery in progress ignore */
> +	if ((reg & PPC_BIT(0)) == 0) {
> +		PHBDBG(p, "CAPP: No recovery in progress\n");
> +		return 0;
> +	}
> +
> +	PHBDBG(p, "CAPP: Waiting for recovery to complete\n");
> +	/* recovery timer failure period 168ms */
> +	end = mftb() + msecs_to_tb(168);
> +	while ((reg & (PPC_BIT(1) | PPC_BIT(5) | PPC_BIT(9))) == 0)
> {
> +
> +		time_wait_ms(5);
> +		xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL +
> offset, &reg);
> +
> +		if (end && tb_compare(mftb(), end) != TB_AAFTERB) {
> +			PHBERR(p, "CAPP: Capp recovery Timed-
> out.\n");
> +			end = 0;
> +			break;
> +		}
> +	}
> +
> +	/* Check if the recovery failed or passed */
> +	if (reg & PPC_BIT(1)) {
> +		PHBDBG(p, "Doing CAPP recovery scoms\n");
> +		/* disable snoops */
> +		xscom_write(p->chip_id, SNOOP_CAPI_CONFIG + offset,
> 0);
> +		load_capp_ucode(p);
> +
> +		/* clear err rpt reg*/
> +		xscom_write(p->chip_id, CAPP_ERR_RPT_CLR + offset,
> 0);
> +
> +		/* clear capp fir */
> +		xscom_write(p->chip_id, CAPP_FIR + offset, 0);
> +
> +		/* Just reset Bit-0,1 and dont touch any other bit
> */
> +		xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL +
> offset, &reg);
> +		reg &= ~(PPC_BIT(0) | PPC_BIT(1));
> +		xscom_write(p->chip_id, CAPP_ERR_STATUS_CTRL +
> offset, reg);
> +
> +		PHBDBG(p, "CAPP recovery complete\n");
> +		rc = OPAL_SUCCESS;
> +
> +	} else {
> +		/* Most likely will checkstop here due to FIR ACTION
> for
> +		 * failed recovery. So this message would never be
> logged.
> +		 * But if we still enter here then return an error
> forcing a
> +		 * fence of the PHB.
> +		 */
> +		if (reg  & PPC_BIT(5))
> +			PHBERR(p, "CAPP: Capp recovery Failed\n");
> +		else if (reg  & PPC_BIT(9))
> +			PHBERR(p, "CAPP: Capp recovery hang
> detected\n");
> +		else if (end != 0)
> +			PHBERR(p, "CAPP: Unknown recovery
> failure\n");
> +
> +		PHBDBG(p, "CAPP: Err/Status-reg=0x%016llx\n", reg);
> +		rc = OPAL_HARDWARE;
> +	}
> +
> +	return rc;
>  }
> 
>  static int64_t phb4_creset(struct pci_slot *slot)
> @@ -2934,8 +2984,9 @@ static int64_t phb4_creset(struct pci_slot
> *slot)
>  			PHBDBG(p, "CRESET: No pending
> transactions\n");
> 
>  			/* capp recovery */
> -			if (p->flags & PHB4_CAPP_RECOVERY)
> -				do_capp_recovery_scoms(p);
> +			if (p->flags & PHB4_CAPP_RECOVERY &&
> +			    do_capp_recovery_scoms(p))

Sorry, I missed a comment here on the last mail. Elsewhere in the
skiboot we explicitly check for ==/!= OPAL_SUCCESS, rather than
assuming the value evaluates to false.

> +				goto error;
> 
>  			/* Clear errors in PFIR and NFIR */
>  			xscom_write(p->chip_id, p->pci_stk_xscom +
> 0x1,
>

Vaibhav Jain March 19, 2018, 4:32 a.m. UTC | #3

Hey Alastair,

Thanks for looking into this patch:

Alastair D'Silva <alastair@au1.ibm.com> writes:

>> +	/* No recovery in progress ignore */
>> +	if ((reg & PPC_BIT(0)) == 0) {
>> +		PHBDBG(p, "CAPP: No recovery in progress\n");
>> +		return 0;
> Did you mean OPAL_SUCCESS here? Other return points in the function use
> the OPAL_* macros.
Good catch. Will include this change in next revision.

diff --git a/hw/phb4.c b/hw/phb4.c
index 47175df2..30f46f9a 100644
--- a/hw/phb4.c
+++ b/hw/phb4.c
@@ -2857,25 +2857,75 @@  static int64_t load_capp_ucode(struct phb4 *p)
 	return rc;
 }
 
-static void do_capp_recovery_scoms(struct phb4 *p)
+static int do_capp_recovery_scoms(struct phb4 *p)
 {
-	uint64_t reg;
-	uint32_t offset;
+	uint64_t rc, reg, end;
+	uint64_t offset = PHB4_CAPP_REG_OFFSET(p);
 
-	PHBDBG(p, "Doing CAPP recovery scoms\n");
-
-	offset = PHB4_CAPP_REG_OFFSET(p);
-	/* disable snoops */
-	xscom_write(p->chip_id, SNOOP_CAPI_CONFIG + offset, 0);
-	load_capp_ucode(p);
-	/* clear err rpt reg*/
-	xscom_write(p->chip_id, CAPP_ERR_RPT_CLR + offset, 0);
-	/* clear capp fir */
-	xscom_write(p->chip_id, CAPP_FIR + offset, 0);
 
+	/* Get the status of CAPP recovery */
 	xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, &reg);
-	reg &= ~(PPC_BIT(0) | PPC_BIT(1));
-	xscom_write(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, reg);
+
+	/* No recovery in progress ignore */
+	if ((reg & PPC_BIT(0)) == 0) {
+		PHBDBG(p, "CAPP: No recovery in progress\n");
+		return 0;
+	}
+
+	PHBDBG(p, "CAPP: Waiting for recovery to complete\n");
+	/* recovery timer failure period 168ms */
+	end = mftb() + msecs_to_tb(168);
+	while ((reg & (PPC_BIT(1) | PPC_BIT(5) | PPC_BIT(9))) == 0) {
+
+		time_wait_ms(5);
+		xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, &reg);
+
+		if (end && tb_compare(mftb(), end) != TB_AAFTERB) {
+			PHBERR(p, "CAPP: Capp recovery Timed-out.\n");
+			end = 0;
+			break;
+		}
+	}
+
+	/* Check if the recovery failed or passed */
+	if (reg & PPC_BIT(1)) {
+		PHBDBG(p, "Doing CAPP recovery scoms\n");
+		/* disable snoops */
+		xscom_write(p->chip_id, SNOOP_CAPI_CONFIG + offset, 0);
+		load_capp_ucode(p);
+
+		/* clear err rpt reg*/
+		xscom_write(p->chip_id, CAPP_ERR_RPT_CLR + offset, 0);
+
+		/* clear capp fir */
+		xscom_write(p->chip_id, CAPP_FIR + offset, 0);
+
+		/* Just reset Bit-0,1 and dont touch any other bit */
+		xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, &reg);
+		reg &= ~(PPC_BIT(0) | PPC_BIT(1));
+		xscom_write(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, reg);
+
+		PHBDBG(p, "CAPP recovery complete\n");
+		rc = OPAL_SUCCESS;
+
+	} else {
+		/* Most likely will checkstop here due to FIR ACTION for
+		 * failed recovery. So this message would never be logged.
+		 * But if we still enter here then return an error forcing a
+		 * fence of the PHB.
+		 */
+		if (reg  & PPC_BIT(5))
+			PHBERR(p, "CAPP: Capp recovery Failed\n");
+		else if (reg  & PPC_BIT(9))
+			PHBERR(p, "CAPP: Capp recovery hang detected\n");
+		else if (end != 0)
+			PHBERR(p, "CAPP: Unknown recovery failure\n");
+
+		PHBDBG(p, "CAPP: Err/Status-reg=0x%016llx\n", reg);
+		rc = OPAL_HARDWARE;
+	}
+
+	return rc;
 }
 
 static int64_t phb4_creset(struct pci_slot *slot)
@@ -2934,8 +2984,9 @@  static int64_t phb4_creset(struct pci_slot *slot)
 			PHBDBG(p, "CRESET: No pending transactions\n");
 
 			/* capp recovery */
-			if (p->flags & PHB4_CAPP_RECOVERY)
-				do_capp_recovery_scoms(p);
+			if (p->flags & PHB4_CAPP_RECOVERY &&
+			    do_capp_recovery_scoms(p))
+				goto error;
 
 			/* Clear errors in PFIR and NFIR */
 			xscom_write(p->chip_id, p->pci_stk_xscom + 0x1,

[v3] capi: Poll Err/Status register during CAPP recovery

Commit Message

Comments

Patch