Message ID | 20191205131434.25091-1-oohall@gmail.com |
---|---|
State | Accepted |
Headers | show |
Series | xscom: Don't log xscom errors caused bu OPAL calls | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch master (d75e82dbfbb9443efeb3f9a5921ac23605aab469) |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot | fail | Test snowpatch/job/snowpatch-skiboot on branch master |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco | success | Signed-off-by present |
Hello Oliver, On Fri, Dec 06, 2019 at 12:14:34AM +1100, Oliver O'Halloran wrote: > The XSCOM read/write OPAL calls are largely there to support running > PRD in the OS. PRD itself handles submitting error logs (if needed) > when a XSCOM operation fails so there's no need to send an error log > from inside of OPAL. > > Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com> > Signed-off-by: Oliver O'Halloran <oohall@gmail.com> This is much better, simpler version than what I posted. Only one question. > --- > core/opal.c | 3 +++ > hw/xscom.c | 15 +++++++++++---- > 2 files changed, 14 insertions(+), 4 deletions(-) > > diff --git a/core/opal.c b/core/opal.c > index da746e805a3f..922dc5abddca 100644 > --- a/core/opal.c > +++ b/core/opal.c > @@ -189,6 +189,9 @@ int64_t opal_exit_check(int64_t retval, struct stack_frame *eframe) > "Spent %llu msecs in OPAL call %llu!\n", > call_time, token); > } > + > + this_cpu->current_token = 0; > + > return retval; > } > > diff --git a/hw/xscom.c b/hw/xscom.c > index f3f4e1039ecf..381cf2c4aaa4 100644 > --- a/hw/xscom.c > +++ b/hw/xscom.c > @@ -272,10 +272,17 @@ static int64_t xscom_handle_error(uint64_t hmer, uint32_t gcid, uint32_t pcb_add > break; > } > > - /* XXX: Create error log entry ? */ > - log_simple_error(&e_info(OPAL_RC_XSCOM_RW), > - "XSCOM: %s error gcid=0x%x pcb_addr=0x%x stat=0x%x\n", > - is_write ? "write" : "read", gcid, pcb_addr, stat); > + /* > + * If we're in an XSCOM opal call then squash the error > + * we assume that the caller (probably opal-prd) will > + * handle logging it > + */ > + if (this_cpu()->current_token != OPAL_XSCOM_READ && > + this_cpu()->current_token != OPAL_XSCOM_WRITE) { Shouldn't we restrict to not creating PELs only when the error is OPAL_XSCOM_ADDR_ERROR, as this was the only observable error when the XSCOM read/writes initiated by HBRT were failing ? > + log_simple_error(&e_info(OPAL_RC_XSCOM_RW), > + "XSCOM: %s error gcid=0x%x pcb_addr=0x%x stat=0x%x\n", > + is_write ? "write" : "read", gcid, pcb_addr, stat); > + } > > /* We need to reset the XSCOM or we'll hang on the next access */ > xscom_reset(gcid, false); > -- > 2.21.0 > -- Thanks and Regards gautham.
On Fri, Dec 6, 2019 at 2:32 AM Gautham R Shenoy <ego@linux.vnet.ibm.com> wrote: > > Hello Oliver, > > > On Fri, Dec 06, 2019 at 12:14:34AM +1100, Oliver O'Halloran wrote: > > The XSCOM read/write OPAL calls are largely there to support running > > PRD in the OS. PRD itself handles submitting error logs (if needed) > > when a XSCOM operation fails so there's no need to send an error log > > from inside of OPAL. > > > > Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com> > > Signed-off-by: Oliver O'Halloran <oohall@gmail.com> > > This is much better, simpler version than what I posted. > > Only one question. > > --- > > core/opal.c | 3 +++ > > hw/xscom.c | 15 +++++++++++---- > > 2 files changed, 14 insertions(+), 4 deletions(-) > > > > diff --git a/core/opal.c b/core/opal.c > > index da746e805a3f..922dc5abddca 100644 > > --- a/core/opal.c > > +++ b/core/opal.c > > @@ -189,6 +189,9 @@ int64_t opal_exit_check(int64_t retval, struct stack_frame *eframe) > > "Spent %llu msecs in OPAL call %llu!\n", > > call_time, token); > > } > > + > > + this_cpu->current_token = 0; > > + > > return retval; > > } > > > > diff --git a/hw/xscom.c b/hw/xscom.c > > index f3f4e1039ecf..381cf2c4aaa4 100644 > > --- a/hw/xscom.c > > +++ b/hw/xscom.c > > @@ -272,10 +272,17 @@ static int64_t xscom_handle_error(uint64_t hmer, uint32_t gcid, uint32_t pcb_add > > break; > > } > > > > - /* XXX: Create error log entry ? */ > > - log_simple_error(&e_info(OPAL_RC_XSCOM_RW), > > - "XSCOM: %s error gcid=0x%x pcb_addr=0x%x stat=0x%x\n", > > - is_write ? "write" : "read", gcid, pcb_addr, stat); > > + /* > > + * If we're in an XSCOM opal call then squash the error > > + * we assume that the caller (probably opal-prd) will > > + * handle logging it > > + */ > > + if (this_cpu()->current_token != OPAL_XSCOM_READ && > > + this_cpu()->current_token != OPAL_XSCOM_WRITE) { > > > Shouldn't we restrict to not creating PELs only when the error is > OPAL_XSCOM_ADDR_ERROR, as this was the only observable error when the > XSCOM read/writes initiated by HBRT were failing ? I'm pretty sure HBRT will creates PEL for any (unexpected) xscom failure so we'd just be doubling up. The other error results we return are: #define OPAL_XSCOM_BUSY OPAL_BUSY #define OPAL_XSCOM_CHIPLET_OFF OPAL_WRONG_STATE #define OPAL_XSCOM_PARTIAL_GOOD -25 #define OPAL_XSCOM_ADDR_ERROR -26 #define OPAL_XSCOM_CLOCK_ERROR -27 #define OPAL_XSCOM_PARITY_ERROR -28 #define OPAL_XSCOM_TIMEOUT -29 #define OPAL_XSCOM_CTR_OFFLINED -30 IMO these are all variations on the same theme. The caller did a scom to a register that isn't accessible for $reason so we might as well handle it consistently. If the error needs to be logged then the caller can deal with it. > > > + log_simple_error(&e_info(OPAL_RC_XSCOM_RW), > > + "XSCOM: %s error gcid=0x%x pcb_addr=0x%x stat=0x%x\n", > > + is_write ? "write" : "read", gcid, pcb_addr, stat); > > + } > > > > /* We need to reset the XSCOM or we'll hang on the next access */ > > xscom_reset(gcid, false); > > -- > > 2.21.0 > > > > -- > Thanks and Regards > gautham.
* Oliver O'Halloran <oohall@gmail.com> [2019-12-06 13:51:46]: > On Fri, Dec 6, 2019 at 2:32 AM Gautham R Shenoy <ego@linux.vnet.ibm.com> wrote: > > > > Hello Oliver, > > > > > > On Fri, Dec 06, 2019 at 12:14:34AM +1100, Oliver O'Halloran wrote: > > > The XSCOM read/write OPAL calls are largely there to support running > > > PRD in the OS. PRD itself handles submitting error logs (if needed) > > > when a XSCOM operation fails so there's no need to send an error log > > > from inside of OPAL. > > > > > > Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com> > > > Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.ibm.com> > > > > This is much better, simpler version than what I posted. > > > > Only one question. > > > --- > > > core/opal.c | 3 +++ > > > hw/xscom.c | 15 +++++++++++---- > > > 2 files changed, 14 insertions(+), 4 deletions(-) > > > > > > diff --git a/core/opal.c b/core/opal.c > > > index da746e805a3f..922dc5abddca 100644 > > > --- a/core/opal.c > > > +++ b/core/opal.c > > > @@ -189,6 +189,9 @@ int64_t opal_exit_check(int64_t retval, struct stack_frame *eframe) > > > "Spent %llu msecs in OPAL call %llu!\n", > > > call_time, token); > > > } > > > + > > > + this_cpu->current_token = 0; > > > + > > > return retval; > > > } > > > > > > diff --git a/hw/xscom.c b/hw/xscom.c > > > index f3f4e1039ecf..381cf2c4aaa4 100644 > > > --- a/hw/xscom.c > > > +++ b/hw/xscom.c > > > @@ -272,10 +272,17 @@ static int64_t xscom_handle_error(uint64_t hmer, uint32_t gcid, uint32_t pcb_add > > > break; > > > } > > > > > > - /* XXX: Create error log entry ? */ > > > - log_simple_error(&e_info(OPAL_RC_XSCOM_RW), > > > - "XSCOM: %s error gcid=0x%x pcb_addr=0x%x stat=0x%x\n", > > > - is_write ? "write" : "read", gcid, pcb_addr, stat); > > > + /* > > > + * If we're in an XSCOM opal call then squash the error > > > + * we assume that the caller (probably opal-prd) will > > > + * handle logging it > > > + */ > > > + if (this_cpu()->current_token != OPAL_XSCOM_READ && > > > + this_cpu()->current_token != OPAL_XSCOM_WRITE) { > > > > > > Shouldn't we restrict to not creating PELs only when the error is > > OPAL_XSCOM_ADDR_ERROR, as this was the only observable error when the > > XSCOM read/writes initiated by HBRT were failing ? Thank Oliver for the nice solution to check for OPAL call context. This takes care of HBRT and userspace xscom commandline for which we need not log an error. Anything else calling xscom during boot will still log an error which is very rare anyway. > I'm pretty sure HBRT will creates PEL for any (unexpected) xscom > failure so we'd just be doubling up. The other error results we return > are: > > #define OPAL_XSCOM_BUSY OPAL_BUSY > #define OPAL_XSCOM_CHIPLET_OFF OPAL_WRONG_STATE > #define OPAL_XSCOM_PARTIAL_GOOD -25 > #define OPAL_XSCOM_ADDR_ERROR -26 > #define OPAL_XSCOM_CLOCK_ERROR -27 > #define OPAL_XSCOM_PARITY_ERROR -28 > #define OPAL_XSCOM_TIMEOUT -29 > #define OPAL_XSCOM_CTR_OFFLINED -30 > > IMO these are all variations on the same theme. The caller did a scom > to a register that isn't accessible for $reason so we might as well > handle it consistently. If the error needs to be logged then the > caller can deal with it. This is fine now since the xscom code is stable and we have coded error recovery procedures. We originally needed error logging where some xscom fails mysteriously affects subsequent xscoms and make them fail as well. Now we will get prlog prints if there where retries. We can revisit this logging area if we run into new problems. --Vaidy
diff --git a/core/opal.c b/core/opal.c index da746e805a3f..922dc5abddca 100644 --- a/core/opal.c +++ b/core/opal.c @@ -189,6 +189,9 @@ int64_t opal_exit_check(int64_t retval, struct stack_frame *eframe) "Spent %llu msecs in OPAL call %llu!\n", call_time, token); } + + this_cpu->current_token = 0; + return retval; } diff --git a/hw/xscom.c b/hw/xscom.c index f3f4e1039ecf..381cf2c4aaa4 100644 --- a/hw/xscom.c +++ b/hw/xscom.c @@ -272,10 +272,17 @@ static int64_t xscom_handle_error(uint64_t hmer, uint32_t gcid, uint32_t pcb_add break; } - /* XXX: Create error log entry ? */ - log_simple_error(&e_info(OPAL_RC_XSCOM_RW), - "XSCOM: %s error gcid=0x%x pcb_addr=0x%x stat=0x%x\n", - is_write ? "write" : "read", gcid, pcb_addr, stat); + /* + * If we're in an XSCOM opal call then squash the error + * we assume that the caller (probably opal-prd) will + * handle logging it + */ + if (this_cpu()->current_token != OPAL_XSCOM_READ && + this_cpu()->current_token != OPAL_XSCOM_WRITE) { + log_simple_error(&e_info(OPAL_RC_XSCOM_RW), + "XSCOM: %s error gcid=0x%x pcb_addr=0x%x stat=0x%x\n", + is_write ? "write" : "read", gcid, pcb_addr, stat); + } /* We need to reset the XSCOM or we'll hang on the next access */ xscom_reset(gcid, false);
The XSCOM read/write OPAL calls are largely there to support running PRD in the OS. PRD itself handles submitting error logs (if needed) when a XSCOM operation fails so there's no need to send an error log from inside of OPAL. Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> --- core/opal.c | 3 +++ hw/xscom.c | 15 +++++++++++---- 2 files changed, 14 insertions(+), 4 deletions(-)