Message ID | 20180830125751.25579-1-vaibhav@linux.ibm.com |
---|---|
State | Superseded |
Headers | show |
Series | phb4: Reset pfir and nfir if new errors reported during ETU reset | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | master/apply_patch Successfully applied |
snowpatch_ozlabs/make_check | success | Test make_check on branch master |
On 08/30/2018 06:27 PM, Vaibhav Jain wrote: > During fast-reboot a PCI device can continue sending requests even > after ETU-Reset is asserted. This will cause new errors to be reported > in ETU fir-registers and will result in values of variables nfir_cache > and pfir_cache to be out of sync. > > Presently during step-2 of CRESET nfir_cache and pfir_cache values are > used to bring the PHB out of reset state. However if these variables > are out of date the nfir/pfir registers are never reset completely and > ETU still remains frozen. > > Hence this patch updates step-2 of phb4_creset to re-read the values of > nfir/pfir registers to check if any new errors were reported after > ETU-reset was asserted, report these new errors and reset the > nfir/pfir registers. This should bring the ETU out of reset > successfully. > > Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Tested-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> -Vasant
On Thu, 2018-08-30 at 18:27 +0530, Vaibhav Jain wrote: > During fast-reboot a PCI device can continue sending requests even > after ETU-Reset is asserted. This will cause new errors to be > reported > in ETU fir-registers and will result in values of variables > nfir_cache > and pfir_cache to be out of sync. > > Presently during step-2 of CRESET nfir_cache and pfir_cache values > are > used to bring the PHB out of reset state. However if these variables > are out of date the nfir/pfir registers are never reset completely > and > ETU still remains frozen. > > Hence this patch updates step-2 of phb4_creset to re-read the values > of > nfir/pfir registers to check if any new errors were reported after > ETU-reset was asserted, report these new errors and reset the > nfir/pfir registers. This should bring the ETU out of reset > successfully. Fun bug. We could avoid this by setting the link disable bit before we assert the ETU reset in stage 1. That way we shouldn't get any more PCIe traffic while the reset is in progress. We set LD later on when we call into phb4_freset() so it shouldn't hurt. Can you or Vasant give this a try with your repro case? diff --git a/hw/phb4.c b/hw/phb4.c index 7cb8b89e5c20..6578aab09e2e 100644 --- a/hw/phb4.c +++ b/hw/phb4.c @@ -3124,6 +3124,12 @@ static int64_t phb4_creset(struct pci_slot *slot) PHBDBG(p, "CRESET: Starts\n"); phb4_prepare_link_change(slot, false); + + phb4_pcicfg_read16(&p->phb, 0, p->ecap + PCICAP_EXP_LCTL, + ®16); + reg16 |= PCICAP_EXP_LCTL_LINK_DIS; + phb4_pcicfg_write16(&p->phb, 0, p->ecap + PCICAP_EXP_LCTL, + reg16); /* Clear error inject register, preventing recursive errors */ xscom_write(p->chip_id, p->pe_xscom + 0x2, 0x0); > > Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> > --- > hw/phb4.c | 27 +++++++++++++++++++++++++++ > 1 file changed, 27 insertions(+) > > diff --git a/hw/phb4.c b/hw/phb4.c > index d1245dce..9c4b54b5 100644 > --- a/hw/phb4.c > +++ b/hw/phb4.c > @@ -3148,6 +3148,33 @@ static int64_t phb4_creset(struct pci_slot > *slot) > xscom_write(p->chip_id, p->pe_stk_xscom + > 0x1, > ~p->nfir_cache); > > + /* Re-read errors in PFIR and NFIR and reset > any new > + * error reported. This may happen as after > fundamental > + * reset was asserted in previous step the > device may > + * still be sending TLPs causing fence to be > raised. > + */ > + xscom_read(p->chip_id, p->pci_stk_xscom + > + XPEC_PCI_STK_PCI_FIR, &p- > >pfir_cache); > + xscom_read(p->chip_id, p->pe_stk_xscom + > + XPEC_NEST_STK_PCI_NFIR, &p- > >nfir_cache); > + > + if (p->pfir_cache || p->nfir_cache) { > + PHBERR(p, "CRESET: PHB still fenced > !!\n"); > + PHBERR(p, "PCI FIR=0x%016llx\n", > + p->pfir_cache); > + PHBERR(p, "NEST FIR=0x%016llx\n", > + p->nfir_cache); > + > + /* Dump other error registers */ > + phb4_eeh_dump_regs(p); > + > + /* Reset the PHB errors */ > + xscom_write(p->chip_id, p- > >pci_stk_xscom + > + XPEC_PCI_STK_PCI_FIR, > 0); > + xscom_write(p->chip_id, p- > >pe_stk_xscom + > + XPEC_NEST_STK_PCI_NFIR, > 0); Do we also need to call phb4_err_clear() here? If a TLP is causing an error during the reset I'd expect some of the lower-level error registers to be set, but those might be cleared during the reset. > + } > + > /* Clear PHB from reset */ > xscom_write(p->chip_id, > p->pci_stk_xscom + > XPEC_PCI_STK_ETU_RESET, 0x0);
On 08/31/2018 11:30 AM, Oliver O'Halloran wrote: > On Thu, 2018-08-30 at 18:27 +0530, Vaibhav Jain wrote: >> During fast-reboot a PCI device can continue sending requests even >> after ETU-Reset is asserted. This will cause new errors to be >> reported >> in ETU fir-registers and will result in values of variables >> nfir_cache >> and pfir_cache to be out of sync. >> >> Presently during step-2 of CRESET nfir_cache and pfir_cache values >> are >> used to bring the PHB out of reset state. However if these variables >> are out of date the nfir/pfir registers are never reset completely >> and >> ETU still remains frozen. >> >> Hence this patch updates step-2 of phb4_creset to re-read the values >> of >> nfir/pfir registers to check if any new errors were reported after >> ETU-reset was asserted, report these new errors and reset the >> nfir/pfir registers. This should bring the ETU out of reset >> successfully. > > Fun bug. > > We could avoid this by setting the link disable bit before we assert > the ETU reset in stage 1. That way we shouldn't get any more PCIe > traffic while the reset is in progress. We set LD later on when we > call into phb4_freset() so it shouldn't hurt. > > Can you or Vasant give this a try with your repro case? Yeah. Below change didn't fix the issue. I still see PHB errors and disks are not getting detected. / # grep PHB /sys/firmware/opal/msglog |grep Error [ 286.016664627,7] PHB#0033:00:00.0 Error -6 resetting [ 286.016682002,7] PHB#0002:00:00.0 Error -6 resetting [ 286.016690554,7] PHB#0001:00:00.0 Error -6 resetting [ 286.016725126,7] PHB#0004:00:00.0 Error -6 resetting [ 286.016736261,7] PHB#0003:00:00.0 Error -6 resetting [ 286.016758274,7] PHB#0031:00:00.0 Error -6 resetting [ 286.016763042,7] PHB#0030:00:00.0 Error -6 resetting [ 286.017111640,7] PHB#0000:00:00.0 Error -6 resetting [ 286.021380651,3] PHB#0002[0:2]: phb4_get_link_info: Error -6 getting link state [ 286.021380646,3] PHB#0001[0:1]: phb4_get_link_info: Error -6 getting link state [ 286.021384703,3] PHB#0033[8:3]: phb4_get_link_info: Error -6 getting link state [ 286.021397387,3] PHB#0033:00:00.0 Error -6 querying link status [ 286.021381928,3] PHB#0004[0:4]: phb4_get_link_info: Error -6 getting link state [ 286.021381356,3] PHB#0003[0:3]: phb4_get_link_info: Error -6 getting link state [ 286.021412009,3] PHB#0004:00:00.0 Error -6 querying link status [ 286.021418175,3] PHB#0003:00:00.0 Error -6 querying link status [ 286.021383000,3] PHB#0030[8:0]: phb4_get_link_info: Error -6 getting link state [ 286.021383992,3] PHB#0002:00:00.0 Error -6 querying link status [ 286.021388334,3] PHB#0001:00:00.0 Error -6 querying link status [ 286.021440040,3] PHB#0030:00:00.0 Error -6 querying link status [ 286.021383563,3] PHB#0031[8:1]: phb4_get_link_info: Error -6 getting link state [ 286.021468511,3] PHB#0031:00:00.0 Error -6 querying link status [ 286.021384405,3] PHB#0000[0:0]: phb4_get_link_info: Error -6 getting link state [ 286.021523256,3] PHB#0000:00:00.0 Error -6 querying link status -Vasant > > diff --git a/hw/phb4.c b/hw/phb4.c > index 7cb8b89e5c20..6578aab09e2e 100644 > --- a/hw/phb4.c > +++ b/hw/phb4.c > @@ -3124,6 +3124,12 @@ static int64_t phb4_creset(struct pci_slot > *slot) > PHBDBG(p, "CRESET: Starts\n"); > > phb4_prepare_link_change(slot, false); > + > + phb4_pcicfg_read16(&p->phb, 0, p->ecap + > PCICAP_EXP_LCTL, > + ®16); > + reg16 |= PCICAP_EXP_LCTL_LINK_DIS; > + phb4_pcicfg_write16(&p->phb, 0, p->ecap + > PCICAP_EXP_LCTL, > + reg16); > /* Clear error inject register, preventing recursive > errors */ > xscom_write(p->chip_id, p->pe_xscom + 0x2, 0x0); > > >> >> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> >> --- >> hw/phb4.c | 27 +++++++++++++++++++++++++++ >> 1 file changed, 27 insertions(+) >> >> diff --git a/hw/phb4.c b/hw/phb4.c >> index d1245dce..9c4b54b5 100644 >> --- a/hw/phb4.c >> +++ b/hw/phb4.c >> @@ -3148,6 +3148,33 @@ static int64_t phb4_creset(struct pci_slot >> *slot) >> xscom_write(p->chip_id, p->pe_stk_xscom + >> 0x1, >> ~p->nfir_cache); >> >> + /* Re-read errors in PFIR and NFIR and reset >> any new >> + * error reported. This may happen as after >> fundamental >> + * reset was asserted in previous step the >> device may >> + * still be sending TLPs causing fence to be >> raised. >> + */ >> + xscom_read(p->chip_id, p->pci_stk_xscom + >> + XPEC_PCI_STK_PCI_FIR, &p- >>> pfir_cache); >> + xscom_read(p->chip_id, p->pe_stk_xscom + >> + XPEC_NEST_STK_PCI_NFIR, &p- >>> nfir_cache); >> + >> + if (p->pfir_cache || p->nfir_cache) { >> + PHBERR(p, "CRESET: PHB still fenced >> !!\n"); >> + PHBERR(p, "PCI FIR=0x%016llx\n", >> + p->pfir_cache); >> + PHBERR(p, "NEST FIR=0x%016llx\n", >> + p->nfir_cache); >> + >> + /* Dump other error registers */ >> + phb4_eeh_dump_regs(p); >> + >> + /* Reset the PHB errors */ >> + xscom_write(p->chip_id, p- >>> pci_stk_xscom + >> + XPEC_PCI_STK_PCI_FIR, >> 0); >> + xscom_write(p->chip_id, p- >>> pe_stk_xscom + >> + XPEC_NEST_STK_PCI_NFIR, >> 0); > > Do we also need to call phb4_err_clear() here? If a TLP is causing an > error during the reset I'd expect some of the lower-level error > registers to be set, but those might be cleared during the reset. > >> + } >> + >> /* Clear PHB from reset */ >> xscom_write(p->chip_id, >> p->pci_stk_xscom + >> XPEC_PCI_STK_ETU_RESET, 0x0); >
We should probably just merge this and if it turns out there's other problems we can fix them later. On Thu, Aug 30, 2018 at 10:57 PM, Vaibhav Jain <vaibhav@linux.ibm.com> wrote: > During fast-reboot a PCI device can continue sending requests even > after ETU-Reset is asserted. This will cause new errors to be reported > in ETU fir-registers and will result in values of variables nfir_cache > and pfir_cache to be out of sync. > > Presently during step-2 of CRESET nfir_cache and pfir_cache values are > used to bring the PHB out of reset state. However if these variables > are out of date the nfir/pfir registers are never reset completely and > ETU still remains frozen. > > Hence this patch updates step-2 of phb4_creset to re-read the values of > nfir/pfir registers to check if any new errors were reported after > ETU-reset was asserted, report these new errors and reset the > nfir/pfir registers. This should bring the ETU out of reset > successfully. > > Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> > --- > hw/phb4.c | 27 +++++++++++++++++++++++++++ > 1 file changed, 27 insertions(+) > > diff --git a/hw/phb4.c b/hw/phb4.c > index d1245dce..9c4b54b5 100644 > --- a/hw/phb4.c > +++ b/hw/phb4.c > @@ -3148,6 +3148,33 @@ static int64_t phb4_creset(struct pci_slot *slot) > xscom_write(p->chip_id, p->pe_stk_xscom + 0x1, > ~p->nfir_cache); > > + /* Re-read errors in PFIR and NFIR and reset any new > + * error reported. This may happen as after fundamental > + * reset was asserted in previous step the device may > + * still be sending TLPs causing fence to be raised. > + */ > + xscom_read(p->chip_id, p->pci_stk_xscom + > + XPEC_PCI_STK_PCI_FIR, &p->pfir_cache); > + xscom_read(p->chip_id, p->pe_stk_xscom + > + XPEC_NEST_STK_PCI_NFIR, &p->nfir_cache); > + > + if (p->pfir_cache || p->nfir_cache) { > + PHBERR(p, "CRESET: PHB still fenced !!\n"); > + PHBERR(p, "PCI FIR=0x%016llx\n", > + p->pfir_cache); > + PHBERR(p, "NEST FIR=0x%016llx\n", > + p->nfir_cache); The AIB error log register also useful here, can you also dump that? Something like this does the trick: diff --git a/hw/phb4.c b/hw/phb4.c index f463d20d15be..0f6d8bc5eb0b 100644 --- a/hw/phb4.c +++ b/hw/phb4.c @@ -3200,9 +3200,16 @@ static int64_t phb4_creset(struct pci_slot *slot) XPEC_NEST_STK_PCI_NFIR, &p->nfir_cache); if (p->pfir_cache || p->nfir_cache) { + uint64_t err_aib; + + xscom_read(p->chip_id, p->pci_stk_xscom + + XPEC_PCI_STK_PBAIB_ERR_REPORT, + &err_aib); + PHBERR(p, "CRESET: PHB still fenced !!\n"); PHBERR(p, "PCI FIR=0x%016llx\n", p->pfir_cache); PHBERR(p, "NEST FIR=0x%016llx\n", + PHBERR(p, "AIB ERR=0x%016llx\n", err_aib); p->nfir_cache); > + /* Dump other error registers */ > + phb4_eeh_dump_regs(p); At this point the ETU is still in reset and trying to dump the error registers floods the msglog with XSCOM errors. I think this is because the HV indirect register access xscoms don't work while reset is asserted so this should probably be removed. Otherwise, Reviewed-by: Oliver O'Halloran <oohall@gmail.com> > + > + /* Reset the PHB errors */ > + xscom_write(p->chip_id, p->pci_stk_xscom + > + XPEC_PCI_STK_PCI_FIR, 0); > + xscom_write(p->chip_id, p->pe_stk_xscom + > + XPEC_NEST_STK_PCI_NFIR, 0); > + } > + > /* Clear PHB from reset */ > xscom_write(p->chip_id, > p->pci_stk_xscom + XPEC_PCI_STK_ETU_RESET, 0x0); > -- > 2.17.1 > > _______________________________________________ > Skiboot mailing list > Skiboot@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/skiboot
diff --git a/hw/phb4.c b/hw/phb4.c index d1245dce..9c4b54b5 100644 --- a/hw/phb4.c +++ b/hw/phb4.c @@ -3148,6 +3148,33 @@ static int64_t phb4_creset(struct pci_slot *slot) xscom_write(p->chip_id, p->pe_stk_xscom + 0x1, ~p->nfir_cache); + /* Re-read errors in PFIR and NFIR and reset any new + * error reported. This may happen as after fundamental + * reset was asserted in previous step the device may + * still be sending TLPs causing fence to be raised. + */ + xscom_read(p->chip_id, p->pci_stk_xscom + + XPEC_PCI_STK_PCI_FIR, &p->pfir_cache); + xscom_read(p->chip_id, p->pe_stk_xscom + + XPEC_NEST_STK_PCI_NFIR, &p->nfir_cache); + + if (p->pfir_cache || p->nfir_cache) { + PHBERR(p, "CRESET: PHB still fenced !!\n"); + PHBERR(p, "PCI FIR=0x%016llx\n", + p->pfir_cache); + PHBERR(p, "NEST FIR=0x%016llx\n", + p->nfir_cache); + + /* Dump other error registers */ + phb4_eeh_dump_regs(p); + + /* Reset the PHB errors */ + xscom_write(p->chip_id, p->pci_stk_xscom + + XPEC_PCI_STK_PCI_FIR, 0); + xscom_write(p->chip_id, p->pe_stk_xscom + + XPEC_NEST_STK_PCI_NFIR, 0); + } + /* Clear PHB from reset */ xscom_write(p->chip_id, p->pci_stk_xscom + XPEC_PCI_STK_ETU_RESET, 0x0);
During fast-reboot a PCI device can continue sending requests even after ETU-Reset is asserted. This will cause new errors to be reported in ETU fir-registers and will result in values of variables nfir_cache and pfir_cache to be out of sync. Presently during step-2 of CRESET nfir_cache and pfir_cache values are used to bring the PHB out of reset state. However if these variables are out of date the nfir/pfir registers are never reset completely and ETU still remains frozen. Hence this patch updates step-2 of phb4_creset to re-read the values of nfir/pfir registers to check if any new errors were reported after ETU-reset was asserted, report these new errors and reset the nfir/pfir registers. This should bring the ETU out of reset successfully. Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> --- hw/phb4.c | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+)