Message ID | 20180319041535.27251-1-vaibhav@linux.vnet.ibm.com |
---|---|
State | Superseded |
Headers | show |
Series | [v3] capi: Poll Err/Status register during CAPP recovery | expand |
On Mon, 2018-03-19 at 09:45 +0530, Vaibhav Jain wrote: > This patch updates do_capp_recovery_scoms() to poll the CAPP > Err/Status control register, check for CAPP-Recovery to complete/fail > based on indications of BITS-1,5,9 and then proceed with the > CAPP-Recovery scoms iif recovery completed successfully. This would > prevent cases where we bring-up the PCIe link while recovery > sequencer > on CAPP is still busy with casting out cache lines. > > In case CAPP-Recovery didn't complete successfully an error is > returned > from do_capp_recovery_scoms() asking phb4_creset() to keep the phb4 > fenced and mark it as broken. > > The loop that implements polling of Err/Status register will also log > an error on the PHB when it continues for more than 168ms which is > the > max time to failure for CAPP-Recovery. > > Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com> > --- > Changelog: > v3 -> Introduced a timeout for the Poll loop of 168ms [Christophe] > > v2 -> Added an extra check for Bit(0) in Err/Status reg at the > beginning to check if recovery mode was entered. [Christophe] > --- > hw/phb4.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++--- > ---------- > 1 file changed, 68 insertions(+), 17 deletions(-) > > diff --git a/hw/phb4.c b/hw/phb4.c > index 47175df2..30f46f9a 100644 > --- a/hw/phb4.c > +++ b/hw/phb4.c > @@ -2857,25 +2857,75 @@ static int64_t load_capp_ucode(struct phb4 > *p) > return rc; > } > > -static void do_capp_recovery_scoms(struct phb4 *p) > +static int do_capp_recovery_scoms(struct phb4 *p) > { > - uint64_t reg; > - uint32_t offset; > + uint64_t rc, reg, end; > + uint64_t offset = PHB4_CAPP_REG_OFFSET(p); > > - PHBDBG(p, "Doing CAPP recovery scoms\n"); > - > - offset = PHB4_CAPP_REG_OFFSET(p); > - /* disable snoops */ > - xscom_write(p->chip_id, SNOOP_CAPI_CONFIG + offset, 0); > - load_capp_ucode(p); > - /* clear err rpt reg*/ > - xscom_write(p->chip_id, CAPP_ERR_RPT_CLR + offset, 0); > - /* clear capp fir */ > - xscom_write(p->chip_id, CAPP_FIR + offset, 0); > > + /* Get the status of CAPP recovery */ > xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, ®); > - reg &= ~(PPC_BIT(0) | PPC_BIT(1)); > - xscom_write(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, reg); > + > + /* No recovery in progress ignore */ > + if ((reg & PPC_BIT(0)) == 0) { > + PHBDBG(p, "CAPP: No recovery in progress\n"); > + return 0; Did you mean OPAL_SUCCESS here? Other return points in the function use the OPAL_* macros. > + } > + > + PHBDBG(p, "CAPP: Waiting for recovery to complete\n"); > + /* recovery timer failure period 168ms */ > + end = mftb() + msecs_to_tb(168); > + while ((reg & (PPC_BIT(1) | PPC_BIT(5) | PPC_BIT(9))) == 0) > { > + > + time_wait_ms(5); > + xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL + > offset, ®); > + > + if (end && tb_compare(mftb(), end) != TB_AAFTERB) { > + PHBERR(p, "CAPP: Capp recovery Timed- > out.\n"); > + end = 0; > + break; > + } > + } > + > + /* Check if the recovery failed or passed */ > + if (reg & PPC_BIT(1)) { > + PHBDBG(p, "Doing CAPP recovery scoms\n"); > + /* disable snoops */ > + xscom_write(p->chip_id, SNOOP_CAPI_CONFIG + offset, > 0); > + load_capp_ucode(p); > + > + /* clear err rpt reg*/ > + xscom_write(p->chip_id, CAPP_ERR_RPT_CLR + offset, > 0); > + > + /* clear capp fir */ > + xscom_write(p->chip_id, CAPP_FIR + offset, 0); > + > + /* Just reset Bit-0,1 and dont touch any other bit > */ > + xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL + > offset, ®); > + reg &= ~(PPC_BIT(0) | PPC_BIT(1)); > + xscom_write(p->chip_id, CAPP_ERR_STATUS_CTRL + > offset, reg); > + > + PHBDBG(p, "CAPP recovery complete\n"); > + rc = OPAL_SUCCESS; > + > + } else { > + /* Most likely will checkstop here due to FIR ACTION > for > + * failed recovery. So this message would never be > logged. > + * But if we still enter here then return an error > forcing a > + * fence of the PHB. > + */ > + if (reg & PPC_BIT(5)) > + PHBERR(p, "CAPP: Capp recovery Failed\n"); > + else if (reg & PPC_BIT(9)) > + PHBERR(p, "CAPP: Capp recovery hang > detected\n"); > + else if (end != 0) > + PHBERR(p, "CAPP: Unknown recovery > failure\n"); > + > + PHBDBG(p, "CAPP: Err/Status-reg=0x%016llx\n", reg); > + rc = OPAL_HARDWARE; > + } > + > + return rc; > } > > static int64_t phb4_creset(struct pci_slot *slot) > @@ -2934,8 +2984,9 @@ static int64_t phb4_creset(struct pci_slot > *slot) > PHBDBG(p, "CRESET: No pending > transactions\n"); > > /* capp recovery */ > - if (p->flags & PHB4_CAPP_RECOVERY) > - do_capp_recovery_scoms(p); > + if (p->flags & PHB4_CAPP_RECOVERY && > + do_capp_recovery_scoms(p)) > + goto error; > > /* Clear errors in PFIR and NFIR */ > xscom_write(p->chip_id, p->pci_stk_xscom + > 0x1, >
This patch updates do_capp_recovery_scoms() to poll the CAPP > Err/Status control register, check for CAPP-Recovery to complete/fail > based on indications of BITS-1,5,9 and then proceed with the > CAPP-Recovery scoms iif recovery completed successfully. This would > prevent cases where we bring-up the PCIe link while recovery > sequencer > on CAPP is still busy with casting out cache lines. > > In case CAPP-Recovery didn't complete successfully an error is > returned > from do_capp_recovery_scoms() asking phb4_creset() to keep the phb4 > fenced and mark it as broken. > > The loop that implements polling of Err/Status register will also log > an error on the PHB when it continues for more than 168ms which is > the > max time to failure for CAPP-Recovery. > > Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com> > --- > Changelog: > v3 -> Introduced a timeout for the Poll loop of 168ms [Christophe] > > v2 -> Added an extra check for Bit(0) in Err/Status reg at the > beginning to check if recovery mode was entered. [Christophe] > --- > hw/phb4.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++--- > ---------- > 1 file changed, 68 insertions(+), 17 deletions(-) > > diff --git a/hw/phb4.c b/hw/phb4.c > index 47175df2..30f46f9a 100644 > --- a/hw/phb4.c > +++ b/hw/phb4.c > @@ -2857,25 +2857,75 @@ static int64_t load_capp_ucode(struct phb4 > *p) > return rc; > } > > -static void do_capp_recovery_scoms(struct phb4 *p) > +static int do_capp_recovery_scoms(struct phb4 *p) > { > - uint64_t reg; > - uint32_t offset; > + uint64_t rc, reg, end; > + uint64_t offset = PHB4_CAPP_REG_OFFSET(p); > > - PHBDBG(p, "Doing CAPP recovery scoms\n"); > - > - offset = PHB4_CAPP_REG_OFFSET(p); > - /* disable snoops */ > - xscom_write(p->chip_id, SNOOP_CAPI_CONFIG + offset, 0); > - load_capp_ucode(p); > - /* clear err rpt reg*/ > - xscom_write(p->chip_id, CAPP_ERR_RPT_CLR + offset, 0); > - /* clear capp fir */ > - xscom_write(p->chip_id, CAPP_FIR + offset, 0); > > + /* Get the status of CAPP recovery */ > xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, ®); > - reg &= ~(PPC_BIT(0) | PPC_BIT(1)); > - xscom_write(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, reg); > + > + /* No recovery in progress ignore */ > + if ((reg & PPC_BIT(0)) == 0) { > + PHBDBG(p, "CAPP: No recovery in progress\n"); > + return 0; > + } > + > + PHBDBG(p, "CAPP: Waiting for recovery to complete\n"); > + /* recovery timer failure period 168ms */ > + end = mftb() + msecs_to_tb(168); > + while ((reg & (PPC_BIT(1) | PPC_BIT(5) | PPC_BIT(9))) == 0) > { > + > + time_wait_ms(5); > + xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL + > offset, ®); > + > + if (end && tb_compare(mftb(), end) != TB_AAFTERB) { > + PHBERR(p, "CAPP: Capp recovery Timed- > out.\n"); > + end = 0; > + break; > + } > + } > + > + /* Check if the recovery failed or passed */ > + if (reg & PPC_BIT(1)) { > + PHBDBG(p, "Doing CAPP recovery scoms\n"); > + /* disable snoops */ > + xscom_write(p->chip_id, SNOOP_CAPI_CONFIG + offset, > 0); > + load_capp_ucode(p); > + > + /* clear err rpt reg*/ > + xscom_write(p->chip_id, CAPP_ERR_RPT_CLR + offset, > 0); > + > + /* clear capp fir */ > + xscom_write(p->chip_id, CAPP_FIR + offset, 0); > + > + /* Just reset Bit-0,1 and dont touch any other bit > */ > + xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL + > offset, ®); > + reg &= ~(PPC_BIT(0) | PPC_BIT(1)); > + xscom_write(p->chip_id, CAPP_ERR_STATUS_CTRL + > offset, reg); > + > + PHBDBG(p, "CAPP recovery complete\n"); > + rc = OPAL_SUCCESS; > + > + } else { > + /* Most likely will checkstop here due to FIR ACTION > for > + * failed recovery. So this message would never be > logged. > + * But if we still enter here then return an error > forcing a > + * fence of the PHB. > + */ > + if (reg & PPC_BIT(5)) > + PHBERR(p, "CAPP: Capp recovery Failed\n"); > + else if (reg & PPC_BIT(9)) > + PHBERR(p, "CAPP: Capp recovery hang > detected\n"); > + else if (end != 0) > + PHBERR(p, "CAPP: Unknown recovery > failure\n"); > + > + PHBDBG(p, "CAPP: Err/Status-reg=0x%016llx\n", reg); > + rc = OPAL_HARDWARE; > + } > + > + return rc; > } > > static int64_t phb4_creset(struct pci_slot *slot) > @@ -2934,8 +2984,9 @@ static int64_t phb4_creset(struct pci_slot > *slot) > PHBDBG(p, "CRESET: No pending > transactions\n"); > > /* capp recovery */ > - if (p->flags & PHB4_CAPP_RECOVERY) > - do_capp_recovery_scoms(p); > + if (p->flags & PHB4_CAPP_RECOVERY && > + do_capp_recovery_scoms(p)) Sorry, I missed a comment here on the last mail. Elsewhere in the skiboot we explicitly check for ==/!= OPAL_SUCCESS, rather than assuming the value evaluates to false. > + goto error; > > /* Clear errors in PFIR and NFIR */ > xscom_write(p->chip_id, p->pci_stk_xscom + > 0x1, >
Hey Alastair, Thanks for looking into this patch: Alastair D'Silva <alastair@au1.ibm.com> writes: >> + /* No recovery in progress ignore */ >> + if ((reg & PPC_BIT(0)) == 0) { >> + PHBDBG(p, "CAPP: No recovery in progress\n"); >> + return 0; > Did you mean OPAL_SUCCESS here? Other return points in the function use > the OPAL_* macros. Good catch. Will include this change in next revision.
diff --git a/hw/phb4.c b/hw/phb4.c index 47175df2..30f46f9a 100644 --- a/hw/phb4.c +++ b/hw/phb4.c @@ -2857,25 +2857,75 @@ static int64_t load_capp_ucode(struct phb4 *p) return rc; } -static void do_capp_recovery_scoms(struct phb4 *p) +static int do_capp_recovery_scoms(struct phb4 *p) { - uint64_t reg; - uint32_t offset; + uint64_t rc, reg, end; + uint64_t offset = PHB4_CAPP_REG_OFFSET(p); - PHBDBG(p, "Doing CAPP recovery scoms\n"); - - offset = PHB4_CAPP_REG_OFFSET(p); - /* disable snoops */ - xscom_write(p->chip_id, SNOOP_CAPI_CONFIG + offset, 0); - load_capp_ucode(p); - /* clear err rpt reg*/ - xscom_write(p->chip_id, CAPP_ERR_RPT_CLR + offset, 0); - /* clear capp fir */ - xscom_write(p->chip_id, CAPP_FIR + offset, 0); + /* Get the status of CAPP recovery */ xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, ®); - reg &= ~(PPC_BIT(0) | PPC_BIT(1)); - xscom_write(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, reg); + + /* No recovery in progress ignore */ + if ((reg & PPC_BIT(0)) == 0) { + PHBDBG(p, "CAPP: No recovery in progress\n"); + return 0; + } + + PHBDBG(p, "CAPP: Waiting for recovery to complete\n"); + /* recovery timer failure period 168ms */ + end = mftb() + msecs_to_tb(168); + while ((reg & (PPC_BIT(1) | PPC_BIT(5) | PPC_BIT(9))) == 0) { + + time_wait_ms(5); + xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, ®); + + if (end && tb_compare(mftb(), end) != TB_AAFTERB) { + PHBERR(p, "CAPP: Capp recovery Timed-out.\n"); + end = 0; + break; + } + } + + /* Check if the recovery failed or passed */ + if (reg & PPC_BIT(1)) { + PHBDBG(p, "Doing CAPP recovery scoms\n"); + /* disable snoops */ + xscom_write(p->chip_id, SNOOP_CAPI_CONFIG + offset, 0); + load_capp_ucode(p); + + /* clear err rpt reg*/ + xscom_write(p->chip_id, CAPP_ERR_RPT_CLR + offset, 0); + + /* clear capp fir */ + xscom_write(p->chip_id, CAPP_FIR + offset, 0); + + /* Just reset Bit-0,1 and dont touch any other bit */ + xscom_read(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, ®); + reg &= ~(PPC_BIT(0) | PPC_BIT(1)); + xscom_write(p->chip_id, CAPP_ERR_STATUS_CTRL + offset, reg); + + PHBDBG(p, "CAPP recovery complete\n"); + rc = OPAL_SUCCESS; + + } else { + /* Most likely will checkstop here due to FIR ACTION for + * failed recovery. So this message would never be logged. + * But if we still enter here then return an error forcing a + * fence of the PHB. + */ + if (reg & PPC_BIT(5)) + PHBERR(p, "CAPP: Capp recovery Failed\n"); + else if (reg & PPC_BIT(9)) + PHBERR(p, "CAPP: Capp recovery hang detected\n"); + else if (end != 0) + PHBERR(p, "CAPP: Unknown recovery failure\n"); + + PHBDBG(p, "CAPP: Err/Status-reg=0x%016llx\n", reg); + rc = OPAL_HARDWARE; + } + + return rc; } static int64_t phb4_creset(struct pci_slot *slot) @@ -2934,8 +2984,9 @@ static int64_t phb4_creset(struct pci_slot *slot) PHBDBG(p, "CRESET: No pending transactions\n"); /* capp recovery */ - if (p->flags & PHB4_CAPP_RECOVERY) - do_capp_recovery_scoms(p); + if (p->flags & PHB4_CAPP_RECOVERY && + do_capp_recovery_scoms(p)) + goto error; /* Clear errors in PFIR and NFIR */ xscom_write(p->chip_id, p->pci_stk_xscom + 0x1,
This patch updates do_capp_recovery_scoms() to poll the CAPP Err/Status control register, check for CAPP-Recovery to complete/fail based on indications of BITS-1,5,9 and then proceed with the CAPP-Recovery scoms iif recovery completed successfully. This would prevent cases where we bring-up the PCIe link while recovery sequencer on CAPP is still busy with casting out cache lines. In case CAPP-Recovery didn't complete successfully an error is returned from do_capp_recovery_scoms() asking phb4_creset() to keep the phb4 fenced and mark it as broken. The loop that implements polling of Err/Status register will also log an error on the PHB when it continues for more than 168ms which is the max time to failure for CAPP-Recovery. Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com> --- Changelog: v3 -> Introduced a timeout for the Poll loop of 168ms [Christophe] v2 -> Added an extra check for Bit(0) in Err/Status reg at the beginning to check if recovery mode was entered. [Christophe] --- hw/phb4.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 68 insertions(+), 17 deletions(-)