Message ID | 20150714075153.GA32237@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Superseded, archived |
Delegated to: | Michael Ellerman |
Headers | show |
Patch looks good to me. A small nit pick below. On 07/14/2015 01:21 PM, Kamalesh Babulal wrote: > * Vipin K Parashar <vipin@linux.vnet.ibm.com> [2015-06-25 00:48:20]: > >> On 06/02/2015 10:48 AM, Kamalesh Babulal wrote: >>> We print the respective warning after parsing EPOW interrupts, >>> prompting user to take action depending upon the severity of the >>> event. >>> >>> Some times same EPOW event warning, such as below could flood kernel >>> log, over a period of time. So Limit the warnings by using ratelimit >>> variant of pr_err. Also, merge adjacent pr_err/pr_emerg into single >>> one to reduce the number of lines printed per warning. >>> >>> May 25 03:46:34 alp kernel: Non critical power or cooling issue cleared >>> May 25 03:46:52 alp kernel: Non critical power or cooling issue cleared >>> May 25 03:53:48 alp kernel: Non critical power or cooling issue cleared >>> May 25 03:55:46 alp kernel: Non critical power or cooling issue cleared >>> May 25 03:56:34 alp kernel: Non critical power or cooling issue cleared >>> May 25 03:59:04 alp kernel: Non critical power or cooling issue cleared >>> May 25 04:02:01 alp kernel: Non critical power or cooling issue cleared >>> May 25 04:04:24 alp kernel: Non critical power or cooling issue cleared >>> May 25 04:07:18 alp kernel: Non critical power or cooling issue cleared >>> May 25 04:13:04 alp kernel: Non critical power or cooling issue cleared >>> May 25 04:22:04 alp kernel: Non critical power or cooling issue cleared >>> May 25 04:22:26 alp kernel: Non critical power or cooling issue cleared >>> May 25 04:22:36 alp kernel: Non critical power or cooling issue cleared >> These messages are minutes apart and thus rate limiting won't help. >> One solution could be to use a flag based approach. Set a flag once a >> EPOW condition is detected and check that flag upon receiving EPOW_RESET. >> EPOW condition clear message should be logged only if a EPOW was previously >> detected i.e. flag found set. > Thanks for reviewing it. Sorry for late response. > > bool flag epow_state, which is initialized to false and when any event gets > reported, the flag set to true once the event gets acknowledged by a reset. > As, seen in the example of flooded messages occurring only with reset event. > The reset action is guarded with bool flag (set only if there was event > reported previously) and ignore multiple resets, without real EPOW event. > > I have only compile tested the patch. If this approach sounds good. > I will resend formal patch. > > > diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c > index 02e4a17..4819b1d 100644 > --- a/arch/powerpc/platforms/pseries/ras.c > +++ b/arch/powerpc/platforms/pseries/ras.c > @@ -40,6 +40,8 @@ static int ras_check_exception_token; > #define EPOW_SENSOR_TOKEN 9 > #define EPOW_SENSOR_INDEX 0 > > +static bool epow_state = false; > + Explicit declaration isn't needed. default value would be false already. A one line comment about flag usage would be good. > static irqreturn_t ras_epow_interrupt(int irq, void *dev_id); > static irqreturn_t ras_error_interrupt(int irq, void *dev_id); > > @@ -145,21 +147,27 @@ static void rtas_parse_epow_errlog(struct rtas_error_log *log) > > switch (action_code) { > case EPOW_RESET: > - pr_err("Non critical power or cooling issue cleared"); > + if (epow_state) { > + pr_err("Non critical power or cooling issue cleared"); > + epow_state = false; > + } > break; > > case EPOW_WARN_COOLING: > - pr_err("Non critical cooling issue reported by firmware"); > - pr_err("Check RTAS error log for details"); > + pr_err("Non critical cooling issue reported by firmware, " > + "Check RTAS error log for details"); > + epow_state = true; > break; > > case EPOW_WARN_POWER: > - pr_err("Non critical power issue reported by firmware"); > - pr_err("Check RTAS error log for details"); > + pr_err("Non critical power issue reported by firmware, " > + "Check RTAS error log for details"); > + epow_state = true; > break; > > case EPOW_SYSTEM_SHUTDOWN: > handle_system_shutdown(epow_log->event_modifier); > + epow_state = true; > break; > > case EPOW_SYSTEM_HALT: > @@ -169,9 +177,8 @@ static void rtas_parse_epow_errlog(struct rtas_error_log *log) > > case EPOW_MAIN_ENCLOSURE: > case EPOW_POWER_OFF: > - pr_emerg("Critical power/cooling issue reported by firmware"); > - pr_emerg("Check RTAS error log for details"); > - pr_emerg("Immediate power off"); > + pr_emerg("Critical power/cooling issue reported by firmware, " > + "Check RTAS error log for details. Immediate power off."); > emergency_sync(); > kernel_power_off(); > break; > @@ -179,6 +186,7 @@ static void rtas_parse_epow_errlog(struct rtas_error_log *log) > default: > pr_err("Unknown power/cooling event (action code %d)", > action_code); > + epow_state = true; > } > } >
diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c index 02e4a17..4819b1d 100644 --- a/arch/powerpc/platforms/pseries/ras.c +++ b/arch/powerpc/platforms/pseries/ras.c @@ -40,6 +40,8 @@ static int ras_check_exception_token; #define EPOW_SENSOR_TOKEN 9 #define EPOW_SENSOR_INDEX 0 +static bool epow_state = false; + static irqreturn_t ras_epow_interrupt(int irq, void *dev_id); static irqreturn_t ras_error_interrupt(int irq, void *dev_id); @@ -145,21 +147,27 @@ static void rtas_parse_epow_errlog(struct rtas_error_log *log) switch (action_code) { case EPOW_RESET: - pr_err("Non critical power or cooling issue cleared"); + if (epow_state) { + pr_err("Non critical power or cooling issue cleared"); + epow_state = false; + } break; case EPOW_WARN_COOLING: - pr_err("Non critical cooling issue reported by firmware"); - pr_err("Check RTAS error log for details"); + pr_err("Non critical cooling issue reported by firmware, " + "Check RTAS error log for details"); + epow_state = true; break; case EPOW_WARN_POWER: - pr_err("Non critical power issue reported by firmware"); - pr_err("Check RTAS error log for details"); + pr_err("Non critical power issue reported by firmware, " + "Check RTAS error log for details"); + epow_state = true; break; case EPOW_SYSTEM_SHUTDOWN: handle_system_shutdown(epow_log->event_modifier); + epow_state = true; break; case EPOW_SYSTEM_HALT: @@ -169,9 +177,8 @@ static void rtas_parse_epow_errlog(struct rtas_error_log *log) case EPOW_MAIN_ENCLOSURE: case EPOW_POWER_OFF: - pr_emerg("Critical power/cooling issue reported by firmware"); - pr_emerg("Check RTAS error log for details"); - pr_emerg("Immediate power off"); + pr_emerg("Critical power/cooling issue reported by firmware, " + "Check RTAS error log for details. Immediate power off."); emergency_sync(); kernel_power_off(); break; @@ -179,6 +186,7 @@ static void rtas_parse_epow_errlog(struct rtas_error_log *log) default: pr_err("Unknown power/cooling event (action code %d)", action_code); + epow_state = true; } }