mbox series

[v9,00/10] Extend regulator notification support

Message ID cover.1620645507.git.matti.vaittinen@fi.rohmeurope.com
Headers show
Series Extend regulator notification support | expand

Message

Matti Vaittinen May 10, 2021, 11:26 a.m. UTC
Extend regulator notification support

This series extends the regulator notification and error flag support.
Initial discussion on the topic can be found here:
https://lore.kernel.org/lkml/6046836e22b8252983f08d5621c35ececb97820d.camel@fi.rohmeurope.com/

This series is built on top of the BD9576MUF support patch series v9
which is currently in MFD tree at immutable branch ib-mfd-watchdog-5.13
https://lore.kernel.org/lkml/cover.1615219345.git.matti.vaittinen@fi.rohmeurope.com/
(The series should apply without those patches but there is compile time
dependency to definitions brought in at the last patch of the BD9576
series. This should be Ok though as there is a Kconfig dependency in
BD9576 regulator driver)

In a nutshell - the series adds:

1. WARNING level events/error flags. (Patch 3)
  Current regulator 'ERROR' event notifications for over/under
  voltage, over current and over temperature are used to indicate
  condition where monitored entity is so badly "off" that it actually
  indicates a hardware error which can not be recovered. The most
  typical hanling for that is believed to be a (graceful)
  system-shutdown. Here we add set of 'WARNING' level flags to allow
  sending notifications to consumers before things are 'that badly off'
  so that consumer drivers can implement recovery-actions.
2. Device-tree properties for specifying limit values. (Patches 1, 5)
  Add limits for above mentioned 'ERROR' and 'WARNING' levels (which
  send notifications to consumers) and also for a 'PROTECTION' level
  (which will be used to immediately shut-down the regulator(s) W/O
  informing consumer drivers. Typically implemented by hardware).
  Property parsing is implemented in regulator core which then calls
  callback operations for limit setting from the IC drivers. A
  warning is emitted if protection is requested by device tree but the
  underlying IC does not support configuring requested protection.
3. Helpers which can be registered by IC. (Patch 4)
  Target is to avoid implementing IRQ handling and IRQ storm protection
  in each IC driver. (Many of the ICs implementin these IRQs do not allow
  masking or acking the IRQ but keep the IRQ asserted for the whole
  duration of problem keeping the processor in IRQ handling loop).
4. Emergency poweroff function (refactored out of the thermal_core to
  kernel/reboot.c) which is called if IC fires error IRQs but IC reading
  fails and given retry-count is exceeded. (Patches 2, 4)
  Please note that the mutex in the emergency shutdown was replaced by a
  simple atomic in order to allow call from any context.

The helper was attempted to be done so it could be used to implement
roughly same logic as is used in qcom-labibb regulator. This means
amongst other things a safety shut-down if IC registers are not readable.
Using these shut-down retry counters are optional. The idea is that the
helper could be also used by simpler ICs which do not provide status
register(s) which can be used to check if error is still active.

ICs which do not have such status register can simply omit the 'renable'
callback (and retry-counts etc) - and helper assumes the situation is Ok
and re-enables IRQ after given time period. If problem persists the
handler is ran again and another notification is sent - but at least the
delay allows processor to avoid IRQ loop.

Patch 7 takes this notification support in use at BD9576MUF.
Patch 8 is related to MFD change which is not really related to the RFC
here. It was added to this series in order to avoid potential conflicts.
Patch 9 adds a maintainers entry.

Changelog v9:
   - rebases on v5.13-rc1
   - Update thermal documentation
   - Fix regulator notification event number
Changelog v8:
   - split shutdown API adding and thermal core taking it in use to
     own patches.
   - replace the spinlock with atomic when ensuring the emergency
     shutdown is only called once.
Changelog v7:
  general:
   - rebased on v5.12-rc7
   - new patch for refactoring the hw-failure reboot logic out of
     thermal_core.c for others to use.
  notification helpers:
   - fix regulator error_flags query
   - grammar/typos
   - do not BUG() but attempt to shut-down the system
   - use BITS_PER_TYPE()

Changelog v6:
  Add MAINTAINERS entry
  Changes to IRQ notifiers
   - move devm functions to drivers/regulator/devres.c
   - drop irq validity check
   - use devm_add_action_or_reset()
   - fix styling issues
   - fix kerneldocs

Changelog v5:
   - Fix the badly formatted pr_emerg() call.

Changelog v4:
   - rebased on v5.12-rc6
   - dropped RFC
   - fix external FET DT-binding.
   - improve prints for cases when expecting HW failure.
   - styling and typos

Changelog v3:
  Regulator core:
   - Fix dangling pointer access at regulator_irq_helper()
  stpmic1_regulator:
   - fix function prototype (compile error)
  bd9576-regulator:
   - Update over current limits to what was given in new data-sheet
     (REV00K)
   - Allow over-current monitoring without external FET. Set limits to
     values given in data-sheet (REV00K).

Changelog v2:
  Generic:
  - rebase on v5.12-rc2 + BD9576 series
  - Split devm variant of delayed wq to own series
  Regulator framework:
  - Provide non devm variant of IRQ notification helpers
  - shorten dt-property names as suggested by Rob
  - unconditionally call map_event in IRQ handling and require it to be
    populated
  BD9576 regulators:
  - change the FET resistance property to micro-ohms
  - fix voltage computation in OC limit setting

--

Matti Vaittinen (10):
  dt_bindings: Add protection limit properties
  reboot: Add hardware protection power-off
  thermal: Use generic HW-protection shutdown API
  regulator: add warning flags
  regulator: IRQ based event/error notification helpers
  regulator: add property parsing and callbacks to set protection limits
  dt-bindings: regulator: bd9576 add FET ON-resistance for OCW
  regulator: bd9576: Support error reporting
  regulator: bd9576: Fix the driver name in id table
  MAINTAINERS: Add reviewer for regulator irq_helpers

 .../bindings/regulator/regulator.yaml         |   82 ++
 .../regulator/rohm,bd9576-regulator.yaml      |    6 +
 .../driver-api/thermal/sysfs-api.rst          |   24 +-
 MAINTAINERS                                   |    4 +
 drivers/regulator/Makefile                    |    2 +-
 drivers/regulator/bd9576-regulator.c          | 1054 +++++++++++++++--
 drivers/regulator/core.c                      |  151 ++-
 drivers/regulator/devres.c                    |   52 +
 drivers/regulator/irq_helpers.c               |  398 +++++++
 drivers/regulator/of_regulator.c              |   58 +
 drivers/regulator/qcom-labibb-regulator.c     |   10 +-
 drivers/regulator/qcom_spmi-regulator.c       |    6 +-
 drivers/regulator/stpmic1_regulator.c         |   20 +-
 drivers/thermal/thermal_core.c                |   63 +-
 include/linux/reboot.h                        |    1 +
 include/linux/regulator/consumer.h            |   14 +
 include/linux/regulator/driver.h              |  176 ++-
 include/linux/regulator/machine.h             |   26 +
 kernel/reboot.c                               |   80 ++
 19 files changed, 2010 insertions(+), 217 deletions(-)
 create mode 100644 drivers/regulator/irq_helpers.c


base-commit: 6efb943b8616ec53a5e444193dccf1af9ad627b5

Comments

Petr Mladek May 12, 2021, 8:20 a.m. UTC | #1
On Mon 2021-05-10 14:28:30, Matti Vaittinen wrote:
> There can be few cases when we need to shut-down the system in order to
> protect the hardware. Currently this is done at east by the thermal core
> when temperature raises over certain limit.
> 
> Some PMICs can also generate interrupts for example for over-current or
> over-voltage, voltage drops, short-circuit, ... etc. On some systems
> these are a sign of hardware failure and only thing to do is try to
> protect the rest of the hardware by shutting down the system.
> 
> Add shut-down logic which can be used by all subsystems instead of
> implementing the shutdown in each subsystem. The logic is stolen from
> thermal_core with difference of using atomic_t instead of a mutex in
> order to allow calls directly from IRQ context.
> 
> Signed-off-by: Matti Vaittinen <matti.vaittinen@fi.rohmeurope.com>
> 
> diff --git a/kernel/reboot.c b/kernel/reboot.c
> index a6ad5eb2fa73..5da8c80a2647 100644
> --- a/kernel/reboot.c
> +++ b/kernel/reboot.c
> @@ -518,6 +519,85 @@ void orderly_reboot(void)
>  }
>  EXPORT_SYMBOL_GPL(orderly_reboot);
>  
> +/**
> + * hw_failure_emergency_poweroff_func - emergency poweroff work after a known delay
> + * @work: work_struct associated with the emergency poweroff function
> + *
> + * This function is called in very critical situations to force
> + * a kernel poweroff after a configurable timeout value.
> + */
> +static void hw_failure_emergency_poweroff_func(struct work_struct *work)
> +{
> +	/*
> +	 * We have reached here after the emergency shutdown waiting period has
> +	 * expired. This means orderly_poweroff has not been able to shut off
> +	 * the system for some reason.
> +	 *
> +	 * Try to shut down the system immediately using kernel_power_off
> +	 * if populated
> +	 */
> +	WARN(1, "Hardware protection timed-out. Trying forced poweroff\n");
> +	kernel_power_off();

WARN() look like an overkill here. It prints many lines that are not
much useful in this case. The function is called from well-known
context (workqueue worker).

Also be aware that "panic_on_warn" commandline option will trigger
panic() here.


> +	/*
> +	 * Worst of the worst case trigger emergency restart
> +	 */
> +	WARN(1,
> +	     "Hardware protection shutdown failed. Trying emergency restart\n");
> +	emergency_restart();

Two consecutive WARN() calls are even less useful. They are eye
catching but it is hard to find the only useful line with
the custom message.

Best Regards,
Petr
Matti Vaittinen May 12, 2021, noon UTC | #2
Hi Petr,

Thanks for the review!

On Wed, 2021-05-12 at 10:20 +0200, Petr Mladek wrote:
> On Mon 2021-05-10 14:28:30, Matti Vaittinen wrote:
> > There can be few cases when we need to shut-down the system in
> > order to
> > protect the hardware. Currently this is done at east by the thermal
> > core
> > when temperature raises over certain limit.
> > 
> > Some PMICs can also generate interrupts for example for over-
> > current or
> > over-voltage, voltage drops, short-circuit, ... etc. On some
> > systems
> > these are a sign of hardware failure and only thing to do is try to
> > protect the rest of the hardware by shutting down the system.
> > 
> > Add shut-down logic which can be used by all subsystems instead of
> > implementing the shutdown in each subsystem. The logic is stolen
> > from
> > thermal_core with difference of using atomic_t instead of a mutex
> > in
> > order to allow calls directly from IRQ context.
> > 
> > Signed-off-by: Matti Vaittinen <matti.vaittinen@fi.rohmeurope.com>
> > 
> > diff --git a/kernel/reboot.c b/kernel/reboot.c
> > index a6ad5eb2fa73..5da8c80a2647 100644
> > --- a/kernel/reboot.c
> > +++ b/kernel/reboot.c
> > @@ -518,6 +519,85 @@ void orderly_reboot(void)
> >  }
> >  EXPORT_SYMBOL_GPL(orderly_reboot);
> >  
> > +/**
> > + * hw_failure_emergency_poweroff_func - emergency poweroff work
> > after a known delay
> > + * @work: work_struct associated with the emergency poweroff
> > function
> > + *
> > + * This function is called in very critical situations to force
> > + * a kernel poweroff after a configurable timeout value.
> > + */
> > +static void hw_failure_emergency_poweroff_func(struct work_struct
> > *work)
> > +{
> > +	/*
> > +	 * We have reached here after the emergency shutdown waiting
> > period has
> > +	 * expired. This means orderly_poweroff has not been able to
> > shut off
> > +	 * the system for some reason.
> > +	 *
> > +	 * Try to shut down the system immediately using
> > kernel_power_off
> > +	 * if populated
> > +	 */
> > +	WARN(1, "Hardware protection timed-out. Trying forced
> > poweroff\n");
> > +	kernel_power_off();
> 
> WARN() look like an overkill here. It prints many lines that are not
> much useful in this case. The function is called from well-known
> context (workqueue worker).

This was the existing code which I stole from the thermal_core. I kind
of think that eye-catching WARN is actually a good choice here. Doing
autonomous power-off without a WARNing does not sound good to me :)

> Also be aware that "panic_on_warn" commandline option will trigger
> panic() here.

Hmm.. If panic() hangs the system that might indeed be a problem. Now
we are (again) on a territory which I don't know well. I'd appreciate
any input from thermal folks and Mark. I don't like the idea of making
extreme things like power-off w/o well visible log-trace. Thus I would
like to have WARN()-like eye-catcher, even if the call-trace was not
too varying. It will at least point to this worker. Any better
suggestions than WARN()?

> 
> > +	/*
> > +	 * Worst of the worst case trigger emergency restart
> > +	 */
> > +	WARN(1,
> > +	     "Hardware protection shutdown failed. Trying emergency
> > restart\n");
> > +	emergency_restart();
> 
> Two consecutive WARN() calls are even less useful. They are eye
> catching but it is hard to find the only useful line with
> the custom message.

I think you are right. One WARN should be enough to point here. This
last one could be just an additional print.

Best Regards
	--Matti Vaittinen
Petr Mladek May 13, 2021, 8:34 a.m. UTC | #3
On Wed 2021-05-12 12:00:46, Vaittinen, Matti wrote:
> On Wed, 2021-05-12 at 10:20 +0200, Petr Mladek wrote:
> > On Mon 2021-05-10 14:28:30, Matti Vaittinen wrote:
> > > There can be few cases when we need to shut-down the system in
> > > order to
> > > protect the hardware. Currently this is done at east by the thermal
> > > core
> > > when temperature raises over certain limit.
> > > 
> > > Some PMICs can also generate interrupts for example for over-
> > > current or
> > > over-voltage, voltage drops, short-circuit, ... etc. On some
> > > systems
> > > these are a sign of hardware failure and only thing to do is try to
> > > protect the rest of the hardware by shutting down the system.
> > > 
> > > Add shut-down logic which can be used by all subsystems instead of
> > > implementing the shutdown in each subsystem. The logic is stolen
> > > from
> > > thermal_core with difference of using atomic_t instead of a mutex
> > > in
> > > order to allow calls directly from IRQ context.
> > > 
> > > Signed-off-by: Matti Vaittinen <matti.vaittinen@fi.rohmeurope.com>
> > > 
> > > diff --git a/kernel/reboot.c b/kernel/reboot.c
> > > index a6ad5eb2fa73..5da8c80a2647 100644
> > > --- a/kernel/reboot.c
> > > +++ b/kernel/reboot.c
> > > @@ -518,6 +519,85 @@ void orderly_reboot(void)
> > >  }
> > >  EXPORT_SYMBOL_GPL(orderly_reboot);
> > >  
> > > +/**
> > > + * hw_failure_emergency_poweroff_func - emergency poweroff work
> > > after a known delay
> > > + * @work: work_struct associated with the emergency poweroff
> > > function
> > > + *
> > > + * This function is called in very critical situations to force
> > > + * a kernel poweroff after a configurable timeout value.
> > > + */
> > > +static void hw_failure_emergency_poweroff_func(struct work_struct
> > > *work)
> > > +{
> > > +	/*
> > > +	 * We have reached here after the emergency shutdown waiting
> > > period has
> > > +	 * expired. This means orderly_poweroff has not been able to
> > > shut off
> > > +	 * the system for some reason.
> > > +	 *
> > > +	 * Try to shut down the system immediately using
> > > kernel_power_off
> > > +	 * if populated
> > > +	 */
> > > +	WARN(1, "Hardware protection timed-out. Trying forced
> > > poweroff\n");
> > > +	kernel_power_off();
> > 
> > WARN() look like an overkill here. It prints many lines that are not
> > much useful in this case. The function is called from well-known
> > context (workqueue worker).
> 
> This was the existing code which I stole from the thermal_core. I kind
> of think that eye-catching WARN is actually a good choice here. Doing
> autonomous power-off without a WARNing does not sound good to me :)
> 
> > Also be aware that "panic_on_warn" commandline option will trigger
> > panic() here.
> 
> Hmm.. If panic() hangs the system that might indeed be a problem. Now
> we are (again) on a territory which I don't know well. I'd appreciate
> any input from thermal folks and Mark. I don't like the idea of making
> extreme things like power-off w/o well visible log-trace. Thus I would
> like to have WARN()-like eye-catcher, even if the call-trace was not
> too varying. It will at least point to this worker. Any better
> suggestions than WARN()?

Heh, it might make sense to create a system wide API for these. I am
sure that WARN() is mis-used this way on many other locations.

There already are two locations that use another eye-catching text.
A common API might help to avoid duplication of the common parts,
see
https://lore.kernel.org/lkml/20210305194206.3165917-2-elver@google.com/

Well, it might be out of scope for this patchset.

Best Regards,
Petr
Matti Vaittinen May 17, 2021, 4:57 a.m. UTC | #4
On Thu, 2021-05-13 at 10:34 +0200, Petr Mladek wrote:
> On Wed 2021-05-12 12:00:46, Vaittinen, Matti wrote:
> > On Wed, 2021-05-12 at 10:20 +0200, Petr Mladek wrote:
> > > On Mon 2021-05-10 14:28:30, Matti Vaittinen wrote:
> > > > There can be few cases when we need to shut-down the system in
> > > > order to
> > > > protect the hardware. Currently this is done at east by the
> > > > thermal
> > > > core
> > > > when temperature raises over certain limit.
> > > > 
> > > > Some PMICs can also generate interrupts for example for over-
> > > > current or
> > > > over-voltage, voltage drops, short-circuit, ... etc. On some
> > > > systems
> > > > these are a sign of hardware failure and only thing to do is
> > > > try to
> > > > protect the rest of the hardware by shutting down the system.
> > > > 
> > > > Add shut-down logic which can be used by all subsystems instead
> > > > of
> > > > implementing the shutdown in each subsystem. The logic is
> > > > stolen
> > > > from
> > > > thermal_core with difference of using atomic_t instead of a
> > > > mutex
> > > > in
> > > > order to allow calls directly from IRQ context.
> > > > 
> > > > Signed-off-by: Matti Vaittinen <
> > > > matti.vaittinen@fi.rohmeurope.com>
> > > > 
> > > > diff --git a/kernel/reboot.c b/kernel/reboot.c
> > > > index a6ad5eb2fa73..5da8c80a2647 100644
> > > > --- a/kernel/reboot.c
> > > > +++ b/kernel/reboot.c
> > > > @@ -518,6 +519,85 @@ void orderly_reboot(void)
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(orderly_reboot);
> > > >  
> > > > +/**
> > > > + * hw_failure_emergency_poweroff_func - emergency poweroff
> > > > work
> > > > after a known delay
> > > > + * @work: work_struct associated with the emergency poweroff
> > > > function
> > > > + *
> > > > + * This function is called in very critical situations to
> > > > force
> > > > + * a kernel poweroff after a configurable timeout value.
> > > > + */
> > > > +static void hw_failure_emergency_poweroff_func(struct
> > > > work_struct
> > > > *work)
> > > > +{
> > > > +	/*
> > > > +	 * We have reached here after the emergency shutdown
> > > > waiting
> > > > period has
> > > > +	 * expired. This means orderly_poweroff has not been
> > > > able to
> > > > shut off
> > > > +	 * the system for some reason.
> > > > +	 *
> > > > +	 * Try to shut down the system immediately using
> > > > kernel_power_off
> > > > +	 * if populated
> > > > +	 */
> > > > +	WARN(1, "Hardware protection timed-out. Trying forced
> > > > poweroff\n");
> > > > +	kernel_power_off();
> > > 
> > > WARN() look like an overkill here. It prints many lines that are
> > > not
> > > much useful in this case. The function is called from well-known
> > > context (workqueue worker).
> > 
> > This was the existing code which I stole from the thermal_core. I
> > kind
> > of think that eye-catching WARN is actually a good choice here.
> > Doing
> > autonomous power-off without a WARNing does not sound good to me :)
> > 
> > > Also be aware that "panic_on_warn" commandline option will
> > > trigger
> > > panic() here.
> > 
> > Hmm.. If panic() hangs the system that might indeed be a problem.
> > Now
> > we are (again) on a territory which I don't know well. I'd
> > appreciate
> > any input from thermal folks and Mark. I don't like the idea of
> > making
> > extreme things like power-off w/o well visible log-trace. Thus I
> > would
> > like to have WARN()-like eye-catcher, even if the call-trace was
> > not
> > too varying. It will at least point to this worker. Any better
> > suggestions than WARN()?
> 
> Heh, it might make sense to create a system wide API for these. I am
> sure that WARN() is mis-used this way on many other locations.
> 
> There already are two locations that use another eye-catching text.
> A common API might help to avoid duplication of the common parts,
> see
> https://lore.kernel.org/lkml/20210305194206.3165917-2-elver@google.com/
> 
> Well, it might be out of scope for this patchset.

I just had a very brief "chat" with Geert (3 IRC messages, posted
during 4 or 5 days :]) - and Geert pointed me to this:

https://lore.kernel.org/linux-iommu/20210331093104.383705-4-geert+renesas@glider.be/

So, maybe I'll just go with simple pr_emerg() and trust that the
emerg() print should catch attention as such level print probably
should. I'll respin the patch series (probably tomorrow) - let's see
what thermal and regulator folks say :)

Thanks for all the help this far!

Best Regards
	Matti Vaittinen