Message ID | 20171130125051.1433-2-bsingharora@gmail.com |
---|---|
State | New |
Headers | show |
Series | [1/2] hw/npu2: Implement logging HMI actions | expand |
On 11/30/2017 06:20 PM, Balbir Singh wrote: > We tend to call find_core_checkstop_reason() even if the NPU/NX/CAPI > HMI handling paths found and accepted the reason for the local checkstop. Yes, this is because there is a possibility of cascading checkstops where bits from multiple FIRs (NPU/NX/CAPI/CORE) may get fired at once. Hence we need to make sure that we detect reasons from all FIRs including core FIR. > We also trash hmi_evt in the call. Every find_*_checkstop*() caller fills and queues up the hmi_evt to be sent to kernel. Once it is queued, the hmi_evt is free to use by next caller. Thanks, -Mahesh. > > Signed-off-by: Balbir Singh <bsingharora@gmail.com> > --- > core/hmi.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/core/hmi.c b/core/hmi.c > index bb144512..62a916ba 100644 > --- a/core/hmi.c > +++ b/core/hmi.c > @@ -697,7 +697,8 @@ static void decode_malfunction(struct OpalHMIEvent *hmi_evt) > } > } > > - find_core_checkstop_reason(hmi_evt, &event_generated); > + if (!event_generated) > + find_core_checkstop_reason(hmi_evt, &event_generated); > > /* > * If we fail to find checkstop reason, send an unknown HMI event. >
On Fri, Dec 1, 2017 at 7:05 PM, Mahesh Jagannath Salgaonkar <mahesh@linux.vnet.ibm.com> wrote: > On 11/30/2017 06:20 PM, Balbir Singh wrote: >> We tend to call find_core_checkstop_reason() even if the NPU/NX/CAPI >> HMI handling paths found and accepted the reason for the local checkstop. > > Yes, this is because there is a possibility of cascading checkstops > where bits from multiple FIRs (NPU/NX/CAPI/CORE) may get fired at once. > Hence we need to make sure that we detect reasons from all FIRs > including core FIR. > Cascading as in one event for all the triggered FIR's - each causing a malfunction error? Have you ever seen this case, several cascaded FIR bits causing just one HMI malfunction error? >> We also trash hmi_evt in the call. > > Every find_*_checkstop*() caller fills and queues up the hmi_evt to be > sent to kernel. Once it is queued, the hmi_evt is free to use by next > caller. > I agree on the queuing bits, The code looks a bit odd to me find_core_checkstop_reason() unconditionally initializes hmi_evt->severity = OpalHMI_SEV_FATAL; hmi_evt->type = OpalHMI_ERROR_MALFUNC_ALERT; hmi_evt->u.xstop_error.xstop_type = CHECKSTOP_TYPE_CORE; Then in decode_malfunction we do /* * If we fail to find checkstop reason, send an unknown HMI event. */ if (!event_generated) { hmi_evt->u.xstop_error.xstop_type = CHECKSTOP_TYPE_UNKNOWN; hmi_evt->u.xstop_error.xstop_reason = 0; queue_hmi_event(hmi_evt, false); } Do we want to leave the severity as FATAL from there? Moreover the code organization needs revisiting Balbir Singh
On 12/04/2017 11:31 AM, Balbir Singh wrote: > On Fri, Dec 1, 2017 at 7:05 PM, Mahesh Jagannath Salgaonkar > <mahesh@linux.vnet.ibm.com> wrote: >> On 11/30/2017 06:20 PM, Balbir Singh wrote: >>> We tend to call find_core_checkstop_reason() even if the NPU/NX/CAPI >>> HMI handling paths found and accepted the reason for the local checkstop. >> >> Yes, this is because there is a possibility of cascading checkstops >> where bits from multiple FIRs (NPU/NX/CAPI/CORE) may get fired at once. >> Hence we need to make sure that we detect reasons from all FIRs >> including core FIR. >> > > Cascading as in one event for all the triggered FIR's - each causing a > malfunction > error? Have you ever seen this case, several cascaded FIR bits causing just > one HMI malfunction error? Nope. I have never come across such a case but there is a possibility. I remember HW folks mentioning this when we initially implemented HMI handling. > >>> We also trash hmi_evt in the call. >> >> Every find_*_checkstop*() caller fills and queues up the hmi_evt to be >> sent to kernel. Once it is queued, the hmi_evt is free to use by next >> caller. >> > > I agree on the queuing bits, > > The code looks a bit odd to me > > find_core_checkstop_reason() unconditionally initializes > > hmi_evt->severity = OpalHMI_SEV_FATAL; > hmi_evt->type = OpalHMI_ERROR_MALFUNC_ALERT; > hmi_evt->u.xstop_error.xstop_type = CHECKSTOP_TYPE_CORE; > > Then in decode_malfunction we do > > /* > * If we fail to find checkstop reason, send an unknown HMI event. > */ > if (!event_generated) { > hmi_evt->u.xstop_error.xstop_type = CHECKSTOP_TYPE_UNKNOWN; > hmi_evt->u.xstop_error.xstop_reason = 0; > queue_hmi_event(hmi_evt, false); > } > > Do we want to leave the severity as FATAL from there? Moreover the > code organization > needs revisiting That was the original intent for unknown malfunction alert since we considered all malfunction alerts as fatal unrecovered HW errors. Hence the code uses the hmi_evt that was initialized in find_core_checkstop_reason() and just change the xstop_type to Unknown. I agree code need to be more readable. We can re-initialize the hmi_evt again in (!event_generated) case to remove the confusion. Unknown malf alert means that opal handler has failed to detect the HW error reason and we don't know whether it is good idea for kernel to continue. Hence we leave the severity as FATAL and disposition to unrecovered. This also means hmi handler need to be enhanced to add the missing functionality. Do you think we should mark it as a warning and send it as recovered event ? Thanks, -Mahesh.
diff --git a/core/hmi.c b/core/hmi.c index bb144512..62a916ba 100644 --- a/core/hmi.c +++ b/core/hmi.c @@ -697,7 +697,8 @@ static void decode_malfunction(struct OpalHMIEvent *hmi_evt) } } - find_core_checkstop_reason(hmi_evt, &event_generated); + if (!event_generated) + find_core_checkstop_reason(hmi_evt, &event_generated); /* * If we fail to find checkstop reason, send an unknown HMI event.
We tend to call find_core_checkstop_reason() even if the NPU/NX/CAPI HMI handling paths found and accepted the reason for the local checkstop. We also trash hmi_evt in the call. Signed-off-by: Balbir Singh <bsingharora@gmail.com> --- core/hmi.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)