Message ID | 20181016113240.13993-1-vaibhav@linux.ibm.com |
---|---|
State | RFC |
Headers | show |
Series | [RFC] phb4: Wait for PRD to reset the CAPP Fir during recovery | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | master/apply_patch Successfully applied |
snowpatch_ozlabs/make_check | success | Test make_check on branch master |
Le 16/10/2018 à 13:32, Vaibhav Jain a écrit : > During CAPP recovery do_capp_recovery_scoms() will reset the CAPP Fir > register just after CAPP recovery is completed. This has an > unintentional side effect of preventing PRD from analyzing and > reporting this error. If PRD tries to read the CAPP FIR after opal has > already reset it, then it logs a critical error complaining "No active > error bits found". > > To prevent this from happening we update do_capp_recovery_scoms() to > wait for CAPP Fir to be reset by PRD just after CAPP recovery > completes and before we proceed with rest of the CAPP recovery > sequence. A timeout of 5ms is used to wait for CAPP-Fir reset before > we reset the register on our own. This is to guard against the > possibility of Opal PRD daemon crashing/not-running. > > Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> > --- This looks really odd to me. I think we need to understand why the PRD is messing with the CAPP FIR. Fred > hw/phb4.c | 20 +++++++++++++++++--- > 1 file changed, 17 insertions(+), 3 deletions(-) > > diff --git a/hw/phb4.c b/hw/phb4.c > index 10df206b..bb95a953 100644 > --- a/hw/phb4.c > +++ b/hw/phb4.c > @@ -3055,7 +3055,24 @@ static int do_capp_recovery_scoms(struct phb4 *p) > > /* Check if the recovery failed or passed */ > if (reg & PPC_BIT(1)) { > + > + PHBDBG(p, "Waiting for FIR to reset\n"); > + > + /* Wait for FIR to be reset by PRD */ > + end = mftb() + msecs_to_tb(5); > + do { > + xscom_read(p->chip_id, CAPP_FIR + offset, ®); > + time_wait_us(100); > + if (tb_compare(mftb(), end) != TB_ABEFOREB) { > + PHBERR(p, "CAPP: Capp FIR reset Timed-out.\n"); > + break; > + } > + } while (reg); > + > PHBDBG(p, "Doing CAPP recovery scoms\n"); > + /* clear capp fir */ > + xscom_write(p->chip_id, CAPP_FIR + offset, 0); > + > /* disable snoops */ > xscom_write(p->chip_id, SNOOP_CAPI_CONFIG + offset, 0); > load_capp_ucode(p); > @@ -3063,9 +3080,6 @@ static int do_capp_recovery_scoms(struct phb4 *p) > /* clear err rpt reg*/ > xscom_write(p->chip_id, CAPP_ERR_RPT_CLR + offset, 0); > > - /* clear capp fir */ > - xscom_write(p->chip_id, CAPP_FIR + offset, 0); > - > /* Just reset Bit-0,1 and dont touch any other bit */ > xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, ®); > reg &= ~(PPC_BIT(0) | PPC_BIT(1)); >
On 17/10/18 2:43 am, Frederic Barrat wrote: > > > Le 16/10/2018 à 13:32, Vaibhav Jain a écrit : >> During CAPP recovery do_capp_recovery_scoms() will reset the CAPP Fir >> register just after CAPP recovery is completed. This has an >> unintentional side effect of preventing PRD from analyzing and >> reporting this error. If PRD tries to read the CAPP FIR after opal has >> already reset it, then it logs a critical error complaining "No active >> error bits found". >> >> To prevent this from happening we update do_capp_recovery_scoms() to >> wait for CAPP Fir to be reset by PRD just after CAPP recovery >> completes and before we proceed with rest of the CAPP recovery >> sequence. A timeout of 5ms is used to wait for CAPP-Fir reset before >> we reset the register on our own. This is to guard against the >> possibility of Opal PRD daemon crashing/not-running. >> >> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> >> --- > > This looks really odd to me. I think we need to understand why the PRD > is messing with the CAPP FIR. Spinning in skiboot for 5ms is also not great. How does PRD cope with FIRs in other parts of the system which get cleared as part of recovery? (I'm thinking PHBs etc)
diff --git a/hw/phb4.c b/hw/phb4.c index 10df206b..bb95a953 100644 --- a/hw/phb4.c +++ b/hw/phb4.c @@ -3055,7 +3055,24 @@ static int do_capp_recovery_scoms(struct phb4 *p) /* Check if the recovery failed or passed */ if (reg & PPC_BIT(1)) { + + PHBDBG(p, "Waiting for FIR to reset\n"); + + /* Wait for FIR to be reset by PRD */ + end = mftb() + msecs_to_tb(5); + do { + xscom_read(p->chip_id, CAPP_FIR + offset, ®); + time_wait_us(100); + if (tb_compare(mftb(), end) != TB_ABEFOREB) { + PHBERR(p, "CAPP: Capp FIR reset Timed-out.\n"); + break; + } + } while (reg); + PHBDBG(p, "Doing CAPP recovery scoms\n"); + /* clear capp fir */ + xscom_write(p->chip_id, CAPP_FIR + offset, 0); + /* disable snoops */ xscom_write(p->chip_id, SNOOP_CAPI_CONFIG + offset, 0); load_capp_ucode(p); @@ -3063,9 +3080,6 @@ static int do_capp_recovery_scoms(struct phb4 *p) /* clear err rpt reg*/ xscom_write(p->chip_id, CAPP_ERR_RPT_CLR + offset, 0); - /* clear capp fir */ - xscom_write(p->chip_id, CAPP_FIR + offset, 0); - /* Just reset Bit-0,1 and dont touch any other bit */ xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, ®); reg &= ~(PPC_BIT(0) | PPC_BIT(1));
During CAPP recovery do_capp_recovery_scoms() will reset the CAPP Fir register just after CAPP recovery is completed. This has an unintentional side effect of preventing PRD from analyzing and reporting this error. If PRD tries to read the CAPP FIR after opal has already reset it, then it logs a critical error complaining "No active error bits found". To prevent this from happening we update do_capp_recovery_scoms() to wait for CAPP Fir to be reset by PRD just after CAPP recovery completes and before we proceed with rest of the CAPP recovery sequence. A timeout of 5ms is used to wait for CAPP-Fir reset before we reset the register on our own. This is to guard against the possibility of Opal PRD daemon crashing/not-running. Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> --- hw/phb4.c | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-)