diff mbox

[RFC] powerpc/pseries: Ratelimit EPOW event warnings

Message ID 1432787595-9946-1-git-send-email-kamalesh@linux.vnet.ibm.com (mailing list archive)
State Changes Requested
Headers show

Commit Message

Kamalesh Babulal May 28, 2015, 4:33 a.m. UTC
We print the respective warning after parsing EPOW interrupts,
prompting user to take action depending upon the severity of the
event.

Some times same EPOW event warning, such as below could flood kernel
log, within very short duration. So Limit the message by using
ratelimit variant of pr_err.

May 25 03:46:34 alp kernel: Non critical power or cooling issue cleared
May 25 03:46:52 alp kernel: Non critical power or cooling issue cleared
May 25 03:53:48 alp kernel: Non critical power or cooling issue cleared
May 25 03:55:46 alp kernel: Non critical power or cooling issue cleared
May 25 03:56:34 alp kernel: Non critical power or cooling issue cleared
May 25 03:59:04 alp kernel: Non critical power or cooling issue cleared
May 25 04:02:01 alp kernel: Non critical power or cooling issue cleared
May 25 04:04:24 alp kernel: Non critical power or cooling issue cleared
May 25 04:07:18 alp kernel: Non critical power or cooling issue cleared
May 25 04:13:04 alp kernel: Non critical power or cooling issue cleared
May 25 04:22:04 alp kernel: Non critical power or cooling issue cleared
May 25 04:22:26 alp kernel: Non critical power or cooling issue cleared
May 25 04:22:36 alp kernel: Non critical power or cooling issue cleared

Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/platforms/pseries/ras.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Comments

Michael Ellerman June 1, 2015, 11:26 a.m. UTC | #1
On Thu, 2015-05-28 at 10:03 +0530, Kamalesh Babulal wrote:
> We print the respective warning after parsing EPOW interrupts,
> prompting user to take action depending upon the severity of the
> event.
> 
> Some times same EPOW event warning, such as below could flood kernel
> log, within very short duration. So Limit the message by using
> ratelimit variant of pr_err.
> 
> May 25 03:46:34 alp kernel: Non critical power or cooling issue cleared
> May 25 03:46:52 alp kernel: Non critical power or cooling issue cleared
> May 25 03:53:48 alp kernel: Non critical power or cooling issue cleared
> May 25 03:55:46 alp kernel: Non critical power or cooling issue cleared
> May 25 03:56:34 alp kernel: Non critical power or cooling issue cleared
> May 25 03:59:04 alp kernel: Non critical power or cooling issue cleared
> May 25 04:02:01 alp kernel: Non critical power or cooling issue cleared
> May 25 04:04:24 alp kernel: Non critical power or cooling issue cleared
> May 25 04:07:18 alp kernel: Non critical power or cooling issue cleared
> May 25 04:13:04 alp kernel: Non critical power or cooling issue cleared
> May 25 04:22:04 alp kernel: Non critical power or cooling issue cleared
> May 25 04:22:26 alp kernel: Non critical power or cooling issue cleared
> May 25 04:22:36 alp kernel: Non critical power or cooling issue cleared

Looking at the time stamps those are actually all fairly far apart in time,
aren't they? So do we actually see them within a short duration in practice?

It does seem sensible to rate limit them though.

> diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
> index 02e4a17..2556bc2 100644
> --- a/arch/powerpc/platforms/pseries/ras.c
> +++ b/arch/powerpc/platforms/pseries/ras.c
> @@ -145,17 +145,17 @@ static void rtas_parse_epow_errlog(struct rtas_error_log *log)
>  
>  	switch (action_code) {
>  	case EPOW_RESET:
> -		pr_err("Non critical power or cooling issue cleared");
> +		pr_err_ratelimited("Non critical power or cooling issue cleared");
>  		break;
>  
>  	case EPOW_WARN_COOLING:
> -		pr_err("Non critical cooling issue reported by firmware");
> -		pr_err("Check RTAS error log for details");
> +		pr_err_ratelimited("Non critical cooling issue reported by firmware");
> +		pr_err_ratelimited("Check RTAS error log for details");
>  		break;
>  
>  	case EPOW_WARN_POWER:
> -		pr_err("Non critical power issue reported by firmware");
> -		pr_err("Check RTAS error log for details");
> +		pr_err_ratelimited("Non critical power issue reported by firmware");
> +		pr_err_ratelimited("Check RTAS error log for details");
>  		break;

Those last two could be collapsed onto one line which would reduce the spam.

cheers
Kamalesh Babulal June 2, 2015, 5:03 a.m. UTC | #2
* Michael Ellerman <mpe@ellerman.id.au> [2015-06-01 21:26:51]:

> On Thu, 2015-05-28 at 10:03 +0530, Kamalesh Babulal wrote:
> > We print the respective warning after parsing EPOW interrupts,
> > prompting user to take action depending upon the severity of the
> > event.
> > 
> > Some times same EPOW event warning, such as below could flood kernel
> > log, within very short duration. So Limit the message by using
> > ratelimit variant of pr_err.
> > 
> > May 25 03:46:34 alp kernel: Non critical power or cooling issue cleared
> > May 25 03:46:52 alp kernel: Non critical power or cooling issue cleared
> > May 25 03:53:48 alp kernel: Non critical power or cooling issue cleared
> > May 25 03:55:46 alp kernel: Non critical power or cooling issue cleared
> > May 25 03:56:34 alp kernel: Non critical power or cooling issue cleared
> > May 25 03:59:04 alp kernel: Non critical power or cooling issue cleared
> > May 25 04:02:01 alp kernel: Non critical power or cooling issue cleared
> > May 25 04:04:24 alp kernel: Non critical power or cooling issue cleared
> > May 25 04:07:18 alp kernel: Non critical power or cooling issue cleared
> > May 25 04:13:04 alp kernel: Non critical power or cooling issue cleared
> > May 25 04:22:04 alp kernel: Non critical power or cooling issue cleared
> > May 25 04:22:26 alp kernel: Non critical power or cooling issue cleared
> > May 25 04:22:36 alp kernel: Non critical power or cooling issue cleared
> 
> Looking at the time stamps those are actually all fairly far apart in time,
> aren't they? So do we actually see them within a short duration in practice?

Thanks for the review. Agree, I should have phrased it better. My intend was to
say, that these warnings keep flooding the kernel log, over a period of time.

[..]
> >  	case EPOW_WARN_POWER:
> > -		pr_err("Non critical power issue reported by firmware");
> > -		pr_err("Check RTAS error log for details");
> > +		pr_err_ratelimited("Non critical power issue reported by firmware");
> > +		pr_err_ratelimited("Check RTAS error log for details");
> >  		break;
> 
> Those last two could be collapsed onto one line which would reduce the spam.

Yes, it could reduce the number of lines printed. Will resend the patch with the
changes.

Thanks,
Kamalesh.
Michael Ellerman June 2, 2015, 7:01 a.m. UTC | #3
On Tue, 2015-06-02 at 10:33 +0530, Kamalesh Babulal wrote:
> * Michael Ellerman <mpe@ellerman.id.au> [2015-06-01 21:26:51]:
> 
> > On Thu, 2015-05-28 at 10:03 +0530, Kamalesh Babulal wrote:
> > > We print the respective warning after parsing EPOW interrupts,
> > > prompting user to take action depending upon the severity of the
> > > event.
> > > 
> > > Some times same EPOW event warning, such as below could flood kernel
> > > log, within very short duration. So Limit the message by using
> > > ratelimit variant of pr_err.
> > > 
> > > May 25 03:46:34 alp kernel: Non critical power or cooling issue cleared
> > > May 25 03:46:52 alp kernel: Non critical power or cooling issue cleared
> > > May 25 03:53:48 alp kernel: Non critical power or cooling issue cleared
> > > May 25 03:55:46 alp kernel: Non critical power or cooling issue cleared
> > > May 25 03:56:34 alp kernel: Non critical power or cooling issue cleared
> > > May 25 03:59:04 alp kernel: Non critical power or cooling issue cleared
> > > May 25 04:02:01 alp kernel: Non critical power or cooling issue cleared
> > > May 25 04:04:24 alp kernel: Non critical power or cooling issue cleared
> > > May 25 04:07:18 alp kernel: Non critical power or cooling issue cleared
> > > May 25 04:13:04 alp kernel: Non critical power or cooling issue cleared
> > > May 25 04:22:04 alp kernel: Non critical power or cooling issue cleared
> > > May 25 04:22:26 alp kernel: Non critical power or cooling issue cleared
> > > May 25 04:22:36 alp kernel: Non critical power or cooling issue cleared
> > 
> > Looking at the time stamps those are actually all fairly far apart in time,
> > aren't they? So do we actually see them within a short duration in practice?
> 
> Thanks for the review. Agree, I should have phrased it better. My intend was to
> say, that these warnings keep flooding the kernel log, over a period of time.

OK. By default printk_ratelimited() allows up to 10 messages in five seconds,
so it won't reduce the number of messages in the above example.

But I'm still OK with a patch to ratelimit them.

> [..]
> > >  	case EPOW_WARN_POWER:
> > > -		pr_err("Non critical power issue reported by firmware");
> > > -		pr_err("Check RTAS error log for details");
> > > +		pr_err_ratelimited("Non critical power issue reported by firmware");
> > > +		pr_err_ratelimited("Check RTAS error log for details");
> > >  		break;
> > 
> > Those last two could be collapsed onto one line which would reduce the spam.
> 
> Yes, it could reduce the number of lines printed. Will resend the patch with the
> changes.

Thanks.

cheers
diff mbox

Patch

diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
index 02e4a17..2556bc2 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -145,17 +145,17 @@  static void rtas_parse_epow_errlog(struct rtas_error_log *log)
 
 	switch (action_code) {
 	case EPOW_RESET:
-		pr_err("Non critical power or cooling issue cleared");
+		pr_err_ratelimited("Non critical power or cooling issue cleared");
 		break;
 
 	case EPOW_WARN_COOLING:
-		pr_err("Non critical cooling issue reported by firmware");
-		pr_err("Check RTAS error log for details");
+		pr_err_ratelimited("Non critical cooling issue reported by firmware");
+		pr_err_ratelimited("Check RTAS error log for details");
 		break;
 
 	case EPOW_WARN_POWER:
-		pr_err("Non critical power issue reported by firmware");
-		pr_err("Check RTAS error log for details");
+		pr_err_ratelimited("Non critical power issue reported by firmware");
+		pr_err_ratelimited("Check RTAS error log for details");
 		break;
 
 	case EPOW_SYSTEM_SHUTDOWN:
@@ -177,7 +177,7 @@  static void rtas_parse_epow_errlog(struct rtas_error_log *log)
 		break;
 
 	default:
-		pr_err("Unknown power/cooling event (action code %d)",
+		pr_err_ratelimited("Unknown power/cooling event (action code %d)",
 			action_code);
 	}
 }