diff mbox series

[v2] powerpc/mce: log the error for all unrecoverable errors

Message ID 20230127182943.73073-1-ganeshgr@linux.ibm.com (mailing list archive)
State Superseded
Headers show
Series [v2] powerpc/mce: log the error for all unrecoverable errors | expand

Checks

Context Check Description
snowpatch_ozlabs/github-powerpc_ppctests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_selftests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_sparse success Successfully ran 4 jobs.
snowpatch_ozlabs/github-powerpc_clang success Successfully ran 6 jobs.
snowpatch_ozlabs/github-powerpc_kernel_qemu success Successfully ran 24 jobs.

Commit Message

Ganesh Goudar Jan. 27, 2023, 6:29 p.m. UTC
For all unrecoverable errors we are missing to log the
error, Since machine_check_log_err() is not getting called
for unrecoverable errors.

Raise irq work in save_mce_event() for unrecoverable errors,
So that we log the error from MCE event handling block in
timer handler.

Log without this change

 MCE: CPU27: machine check (Severe)  Real address Load/Store (foreign/control memory) [Not recovered]
 MCE: CPU27: PID: 10580 Comm: inject-ra-err NIP: [0000000010000df4]
 MCE: CPU27: Initiator CPU
 MCE: CPU27: Unknown

Log with this change

 MCE: CPU24: machine check (Severe)  Real address Load/Store (foreign/control memory) [Not recovered]
 MCE: CPU24: PID: 1589811 Comm: inject-ra-err NIP: [0000000010000e48]
 MCE: CPU24: Initiator CPU
 MCE: CPU24: Unknown
 RTAS: event: 5, Type: Platform Error (224), Severity: 3

Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
---
V2: Rephrasing the commit message.
---
 arch/powerpc/kernel/mce.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Michael Ellerman Jan. 31, 2023, 11:29 a.m. UTC | #1
Ganesh Goudar <ganeshgr@linux.ibm.com> writes:
> For all unrecoverable errors we are missing to log the
> error, Since machine_check_log_err() is not getting called
> for unrecoverable errors.
>
> Raise irq work in save_mce_event() for unrecoverable errors,
> So that we log the error from MCE event handling block in
> timer handler.

But the patch also removes the irq work raise from machine_check_ue_event().

That's currently done unconditionally, regardless of the disposition. So
doesn't this change also drop logging of recoverable UEs?

Maybe that's OK, but the change log should explain it.

> Log without this change
>
>  MCE: CPU27: machine check (Severe)  Real address Load/Store (foreign/control memory) [Not recovered]
>  MCE: CPU27: PID: 10580 Comm: inject-ra-err NIP: [0000000010000df4]
>  MCE: CPU27: Initiator CPU
>  MCE: CPU27: Unknown
>
> Log with this change
>
>  MCE: CPU24: machine check (Severe)  Real address Load/Store (foreign/control memory) [Not recovered]
>  MCE: CPU24: PID: 1589811 Comm: inject-ra-err NIP: [0000000010000e48]
>  MCE: CPU24: Initiator CPU
>  MCE: CPU24: Unknown
>  RTAS: event: 5, Type: Platform Error (224), Severity: 3
>
> Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
> ---
> V2: Rephrasing the commit message.
> ---
>  arch/powerpc/kernel/mce.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
> index 6c5d30fba766..a1cb2172eb7b 100644
> --- a/arch/powerpc/kernel/mce.c
> +++ b/arch/powerpc/kernel/mce.c
> @@ -131,6 +131,13 @@ void save_mce_event(struct pt_regs *regs, long handled,
>  	if (mce->error_type == MCE_ERROR_TYPE_UE)
>  		mce->u.ue_error.ignore_event = mce_err->ignore_event;
>  
> +	/*
> +	 * Raise irq work, So that we don't miss to log the error for
> +	 * unrecoverable errors.
> +	 */
> +	if (mce->disposition == MCE_DISPOSITION_NOT_RECOVERED)
> +		mce_irq_work_queue();
> +
>  	if (!addr)
>  		return;
>  
> @@ -235,7 +242,6 @@ static void machine_check_ue_event(struct machine_check_event *evt)
>  	       evt, sizeof(*evt));
>  
>  	/* Queue work to process this event later. */

This comment is meaningless without the function call it's commenting
about, ie. the comment should be removed too.

> -	mce_irq_work_queue();
>  }
>  

cheers
Ganesh Goudar Feb. 1, 2023, 9:28 a.m. UTC | #2
On 1/31/23 4:59 PM, Michael Ellerman wrote:

> Ganesh Goudar<ganeshgr@linux.ibm.com>  writes:
>> For all unrecoverable errors we are missing to log the
>> error, Since machine_check_log_err() is not getting called
>> for unrecoverable errors.
>>
>> Raise irq work in save_mce_event() for unrecoverable errors,
>> So that we log the error from MCE event handling block in
>> timer handler.
> But the patch also removes the irq work raise from machine_check_ue_event().
>
> That's currently done unconditionally, regardless of the disposition. So
> doesn't this change also drop logging of recoverable UEs?
>
> Maybe that's OK, but the change log should explain it.

Yes, its ok, exception vector code will do that for recoverable errors, ill explain
this in commit message.

>
>> Log without this change
>>
>>   MCE: CPU27: machine check (Severe)  Real address Load/Store (foreign/control memory) [Not recovered]
>>   MCE: CPU27: PID: 10580 Comm: inject-ra-err NIP: [0000000010000df4]
>>   MCE: CPU27: Initiator CPU
>>   MCE: CPU27: Unknown
>>
>> Log with this change
>>
>>   MCE: CPU24: machine check (Severe)  Real address Load/Store (foreign/control memory) [Not recovered]
>>   MCE: CPU24: PID: 1589811 Comm: inject-ra-err NIP: [0000000010000e48]
>>   MCE: CPU24: Initiator CPU
>>   MCE: CPU24: Unknown
>>   RTAS: event: 5, Type: Platform Error (224), Severity: 3
>>
>> Signed-off-by: Ganesh Goudar<ganeshgr@linux.ibm.com>
>> Reviewed-by: Mahesh Salgaonkar<mahesh@linux.ibm.com>
>> ---
>> V2: Rephrasing the commit message.
>> ---
>>   arch/powerpc/kernel/mce.c | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
>> index 6c5d30fba766..a1cb2172eb7b 100644
>> --- a/arch/powerpc/kernel/mce.c
>> +++ b/arch/powerpc/kernel/mce.c
>> @@ -131,6 +131,13 @@ void save_mce_event(struct pt_regs *regs, long handled,
>>   	if (mce->error_type == MCE_ERROR_TYPE_UE)
>>   		mce->u.ue_error.ignore_event = mce_err->ignore_event;
>>   
>> +	/*
>> +	 * Raise irq work, So that we don't miss to log the error for
>> +	 * unrecoverable errors.
>> +	 */
>> +	if (mce->disposition == MCE_DISPOSITION_NOT_RECOVERED)
>> +		mce_irq_work_queue();
>> +
>>   	if (!addr)
>>   		return;
>>   
>> @@ -235,7 +242,6 @@ static void machine_check_ue_event(struct machine_check_event *evt)
>>   	       evt, sizeof(*evt));
>>   
>>   	/* Queue work to process this event later. */
> This comment is meaningless without the function call it's commenting
> about, ie. the comment should be removed too.

ok.

Thanks.
diff mbox series

Patch

diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index 6c5d30fba766..a1cb2172eb7b 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -131,6 +131,13 @@  void save_mce_event(struct pt_regs *regs, long handled,
 	if (mce->error_type == MCE_ERROR_TYPE_UE)
 		mce->u.ue_error.ignore_event = mce_err->ignore_event;
 
+	/*
+	 * Raise irq work, So that we don't miss to log the error for
+	 * unrecoverable errors.
+	 */
+	if (mce->disposition == MCE_DISPOSITION_NOT_RECOVERED)
+		mce_irq_work_queue();
+
 	if (!addr)
 		return;
 
@@ -235,7 +242,6 @@  static void machine_check_ue_event(struct machine_check_event *evt)
 	       evt, sizeof(*evt));
 
 	/* Queue work to process this event later. */
-	mce_irq_work_queue();
 }
 
 /*