diff mbox series

[RFC] phb4: Wait for PRD to reset the CAPP Fir during recovery

Message ID 20181016113240.13993-1-vaibhav@linux.ibm.com
State RFC
Headers show
Series [RFC] phb4: Wait for PRD to reset the CAPP Fir during recovery | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success master/apply_patch Successfully applied
snowpatch_ozlabs/make_check success Test make_check on branch master

Commit Message

Vaibhav Jain Oct. 16, 2018, 11:32 a.m. UTC
During CAPP recovery do_capp_recovery_scoms() will reset the CAPP Fir
register just after CAPP recovery is completed. This has an
unintentional side effect of preventing PRD from analyzing and
reporting this error. If PRD tries to read the CAPP FIR after opal has
already reset it, then it logs a critical error complaining "No active
error bits found".

To prevent this from happening we update do_capp_recovery_scoms() to
wait for CAPP Fir to be reset by PRD just after CAPP recovery
completes and before we proceed with rest of the CAPP recovery
sequence. A timeout of 5ms is used to wait for CAPP-Fir reset before
we reset the register on our own. This is to guard against the
possibility of Opal PRD daemon crashing/not-running.

Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 hw/phb4.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

Comments

Frederic Barrat Oct. 16, 2018, 3:43 p.m. UTC | #1
Le 16/10/2018 à 13:32, Vaibhav Jain a écrit :
> During CAPP recovery do_capp_recovery_scoms() will reset the CAPP Fir
> register just after CAPP recovery is completed. This has an
> unintentional side effect of preventing PRD from analyzing and
> reporting this error. If PRD tries to read the CAPP FIR after opal has
> already reset it, then it logs a critical error complaining "No active
> error bits found".
> 
> To prevent this from happening we update do_capp_recovery_scoms() to
> wait for CAPP Fir to be reset by PRD just after CAPP recovery
> completes and before we proceed with rest of the CAPP recovery
> sequence. A timeout of 5ms is used to wait for CAPP-Fir reset before
> we reset the register on our own. This is to guard against the
> possibility of Opal PRD daemon crashing/not-running.
> 
> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---

This looks really odd to me. I think we need to understand why the PRD 
is messing with the CAPP FIR.

   Fred


>   hw/phb4.c | 20 +++++++++++++++++---
>   1 file changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/phb4.c b/hw/phb4.c
> index 10df206b..bb95a953 100644
> --- a/hw/phb4.c
> +++ b/hw/phb4.c
> @@ -3055,7 +3055,24 @@ static int do_capp_recovery_scoms(struct phb4 *p)
>   
>   	/* Check if the recovery failed or passed */
>   	if (reg & PPC_BIT(1)) {
> +
> +		PHBDBG(p, "Waiting for FIR to reset\n");
> +
> +		/* Wait for FIR to be reset by PRD */
> +		end = mftb() + msecs_to_tb(5);
> +		do {
> +			xscom_read(p->chip_id, CAPP_FIR + offset, &reg);
> +			time_wait_us(100);
> +			if (tb_compare(mftb(), end) != TB_ABEFOREB) {
> +				PHBERR(p, "CAPP: Capp FIR reset Timed-out.\n");
> +				break;
> +			}
> +		} while (reg);
> +
>   		PHBDBG(p, "Doing CAPP recovery scoms\n");
> +		/* clear capp fir */
> +		xscom_write(p->chip_id, CAPP_FIR + offset, 0);
> +
>   		/* disable snoops */
>   		xscom_write(p->chip_id, SNOOP_CAPI_CONFIG + offset, 0);
>   		load_capp_ucode(p);
> @@ -3063,9 +3080,6 @@ static int do_capp_recovery_scoms(struct phb4 *p)
>   		/* clear err rpt reg*/
>   		xscom_write(p->chip_id, CAPP_ERR_RPT_CLR + offset, 0);
>   
> -		/* clear capp fir */
> -		xscom_write(p->chip_id, CAPP_FIR + offset, 0);
> -
>   		/* Just reset Bit-0,1 and dont touch any other bit */
>   		xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, &reg);
>   		reg &= ~(PPC_BIT(0) | PPC_BIT(1));
>
Andrew Donnellan Oct. 17, 2018, 12:12 a.m. UTC | #2
On 17/10/18 2:43 am, Frederic Barrat wrote:
> 
> 
> Le 16/10/2018 à 13:32, Vaibhav Jain a écrit :
>> During CAPP recovery do_capp_recovery_scoms() will reset the CAPP Fir
>> register just after CAPP recovery is completed. This has an
>> unintentional side effect of preventing PRD from analyzing and
>> reporting this error. If PRD tries to read the CAPP FIR after opal has
>> already reset it, then it logs a critical error complaining "No active
>> error bits found".
>>
>> To prevent this from happening we update do_capp_recovery_scoms() to
>> wait for CAPP Fir to be reset by PRD just after CAPP recovery
>> completes and before we proceed with rest of the CAPP recovery
>> sequence. A timeout of 5ms is used to wait for CAPP-Fir reset before
>> we reset the register on our own. This is to guard against the
>> possibility of Opal PRD daemon crashing/not-running.
>>
>> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
>> ---
> 
> This looks really odd to me. I think we need to understand why the PRD 
> is messing with the CAPP FIR.

Spinning in skiboot for 5ms is also not great.

How does PRD cope with FIRs in other parts of the system which get 
cleared as part of recovery? (I'm thinking PHBs etc)
diff mbox series

Patch

diff --git a/hw/phb4.c b/hw/phb4.c
index 10df206b..bb95a953 100644
--- a/hw/phb4.c
+++ b/hw/phb4.c
@@ -3055,7 +3055,24 @@  static int do_capp_recovery_scoms(struct phb4 *p)
 
 	/* Check if the recovery failed or passed */
 	if (reg & PPC_BIT(1)) {
+
+		PHBDBG(p, "Waiting for FIR to reset\n");
+
+		/* Wait for FIR to be reset by PRD */
+		end = mftb() + msecs_to_tb(5);
+		do {
+			xscom_read(p->chip_id, CAPP_FIR + offset, &reg);
+			time_wait_us(100);
+			if (tb_compare(mftb(), end) != TB_ABEFOREB) {
+				PHBERR(p, "CAPP: Capp FIR reset Timed-out.\n");
+				break;
+			}
+		} while (reg);
+
 		PHBDBG(p, "Doing CAPP recovery scoms\n");
+		/* clear capp fir */
+		xscom_write(p->chip_id, CAPP_FIR + offset, 0);
+
 		/* disable snoops */
 		xscom_write(p->chip_id, SNOOP_CAPI_CONFIG + offset, 0);
 		load_capp_ucode(p);
@@ -3063,9 +3080,6 @@  static int do_capp_recovery_scoms(struct phb4 *p)
 		/* clear err rpt reg*/
 		xscom_write(p->chip_id, CAPP_ERR_RPT_CLR + offset, 0);
 
-		/* clear capp fir */
-		xscom_write(p->chip_id, CAPP_FIR + offset, 0);
-
 		/* Just reset Bit-0,1 and dont touch any other bit */
 		xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, &reg);
 		reg &= ~(PPC_BIT(0) | PPC_BIT(1));