mbox series

[v4,0/7] Extend regulator notification support

Message ID cover.1617690965.git.matti.vaittinen@fi.rohmeurope.com
Headers show
Series Extend regulator notification support | expand

Message

Matti Vaittinen April 6, 2021, 7:12 a.m. UTC
Extend regulator notification support

This series extends the regulator notification and error flag support. Initial
discussion on the topic can be found here:
https://lore.kernel.org/lkml/6046836e22b8252983f08d5621c35ececb97820d.camel@fi.rohmeurope.com/

This series is built on top of the BD9576MUF support patch series v9
which is currently in MFD tree at immutable branch ib-mfd-watchdog-5.13
https://lore.kernel.org/lkml/cover.1615219345.git.matti.vaittinen@fi.rohmeurope.com/
(The series should apply without those patches but there is compile time
dependency to definitions brought in at the last patch of the BD9576
series. This should be Ok though as there is a Kconfig dependency in
BD9576 regulator driver)

In a nutshell - the series adds:

1. WARNING level events/error flags. (Patch 2)
  Current regulator 'ERROR' event notifications for over/under
  voltage, over current and over temperature are used to indicate
  condition where monitored entity is so badly "off" that it actually
  indicates a hardware error which can not be recovered. The most
  typical hanling for that is believed to be a (graceful)
  system-shutdown. Here we add set of 'WARNING' level flags to allow
  sending notifications to consumers before things are 'that badly off'
  so that consumer drivers can implement recovery-actions.
2. Device-tree properties for specifying limit values. (Patches 1, 4)
  Add limits for above mentioned 'ERROR' and 'WARNING' levels (which
  send notifications to consumers) and also for a 'PROTECTION' level
  (which will be used to immediately shut-down the regulator(s) W/O
  informing consumer drivers. Typically implemented by hardware).
  Property parsing is implemented in regulator core which then calls
  callback operations for limit setting from the IC drivers. A
  warning is emitted if protection is requested by device tree but the
  underlying IC does not support configuring requested protection.
3. Helpers which can be registered by IC. (Patch 3)
  Target is to avoid implementing IRQ handling and IRQ storm protection
  in each IC driver. (Many of the ICs implementin these IRQs do not allow
  masking or acking the IRQ but keep the IRQ asserted for the whole
  duration of problem keeping the processor in IRQ handling loop).

The helper was attempted to be done so it could be used to implement
roughly same logic as is used in qcom-labibb regulator. This means
amongst other things a safety shut-down if IC registers are not readable.
Using these shut-down retry counters are optional. The idea is that the
helper could be also used by simpler ICs which do not provide status
register(s) which can be used to check if error is still active.

ICs which do not have such status register can simply omit the 'renable'
callback (and retry-counts etc) - and helper assumes the situation is Ok
and re-enables IRQ after given time period. If problem persists the
handler is ran again and another notification is sent - but at least the
delay allows processor to avoid IRQ loop.

Patch 6 takes this notification support in use at BD9576MUF.
Patch 7 is related to MFD change which is not really related to the RFC
here. It was added to this series in order to avoid potential conflicts.

Changelog v4:
   - rebased on v5.12-rc6
   - dropped RFC
   - fix external FET DT-binding.
   - improve prints for cases when expecting HW failure.
   - styling and typos
Changelog v3:
  Regulator core:
   - Fix dangling pointer access at regulator_irq_helper()
  stpmic1_regulator:
   - fix function prototype (compile error)
  bd9576-regulator:
   - Update over current limits to what was given in new data-sheet
     (REV00K)
   - Allow over-current monitoring without external FET. Set limits to
     values given in data-sheet (REV00K).

Changelog v2:
  Generic:
  - rebase on v5.12-rc2 + BD9576 series
  - Split devm variant of delayed wq to own series
  Regulator framework:
  - Provide non devm variant of IRQ notification helpers
  - shorten dt-property names as suggested by Rob
  - unconditionally call map_event in IRQ handling and require it to be
    populated
  BD9576 regulators:
  - change the FET resistance property to micro-ohms
  - fix voltage computation in OC limit setting

--

Matti Vaittinen (7):
  dt_bindings: Add protection limit properties
  regulator: add warning flags
  regulator: IRQ based event/error notification helpers
  regulator: add property parsing and callbacks to set protection limits
  dt-bindings: regulator: bd9576 add FET ON-resistance for OCW
  regulator: bd9576: Support error reporting
  regulator: bd9576: Fix the driver name in id table

 .../bindings/regulator/regulator.yaml         |   82 ++
 .../regulator/rohm,bd9576-regulator.yaml      |    6 +
 drivers/regulator/Makefile                    |    2 +-
 drivers/regulator/bd9576-regulator.c          | 1060 +++++++++++++++--
 drivers/regulator/core.c                      |  146 ++-
 drivers/regulator/irq_helpers.c               |  431 +++++++
 drivers/regulator/of_regulator.c              |   58 +
 drivers/regulator/qcom-labibb-regulator.c     |   10 +-
 drivers/regulator/qcom_spmi-regulator.c       |    6 +-
 drivers/regulator/stpmic1_regulator.c         |   20 +-
 include/linux/regulator/consumer.h            |   14 +
 include/linux/regulator/driver.h              |  176 ++-
 include/linux/regulator/machine.h             |   26 +
 13 files changed, 1895 insertions(+), 142 deletions(-)
 create mode 100644 drivers/regulator/irq_helpers.c

Comments

Matti Vaittinen April 7, 2021, 5:02 a.m. UTC | #1
Morning Andy,

Thanks for the review! By the way, is it me or did your mail-client
spill this out using HTML?

On Wed, 2021-04-07 at 01:44 +0300, Andy Shevchenko wrote:
> On Tuesday, April 6, 2021, Matti Vaittinen <
> matti.vaittinen@fi.rohmeurope.com> wrote:

> > +static void die_loudly(const char *msg)
> > +{
> > +       pr_emerg(msg);
> 
> Oh là là, besides build bot complaints, this has serious security
> implications. Never do like this.
 
I'm not even trying to claim that was correct. And I did send a fixup -
sorry for this. I don't intend to do this again.

Now, when this is said - If you have a minute, please educate me.
Assuming we know all the callers and that all the callers use this as

die_loudly("foobarfoo\n");
- what is the exploit mechanism?

> > +       BUG();
> > +}
> > +


> > +/**
> > + * regulator_irq_helper - register IRQ based regulator event/error
> > notifier
> > + *
> > + * @dev:               device to which lifetime the helper's
> > lifetime is
> > + *                     bound.
> > + * @d:                 IRQ helper descriptor.
> > + * @irq:               IRQ used to inform events/errors to be
> > notified.
> > + * @irq_flags:         Extra IRQ flags to be OR's with the default
> > IRQF_ONESHOT
> > + *                     when requesting the (threaded) irq.
> > + * @common_errs:       Errors which can be flagged by this IRQ for
> > all rdevs.
> > + *                     When IRQ is re-enabled these errors will be
> > cleared
> > + *                     from all associated regulators
> > + * @per_rdev_errs:     Optional error flag array describing errors
> > specific
> > + *                     for only some of the regulators. These
> > errors will be
> > + *                     or'ed with common erros. If this is given
> > the array
> > + *                     should contain rdev_amount flags. Can be
> > set to NULL
> > + *                     if there is no regulator specific error
> > flags for this
> > + *                     IRQ.
> > + * @rdev:              Array of regulators associated with this
> > IRQ.
> > + * @rdev_amount:       Amount of regulators associated wit this
> > IRQ.
> > + */
> > +void *regulator_irq_helper(struct device *dev,
> > +                           const struct regulator_irq_desc *d, int
> > irq,
> > +                           int irq_flags, int common_errs, int
> > *per_rdev_errs,
> > +                           struct regulator_dev **rdev, int
> > rdev_amount)
> > +{
> > +       struct regulator_irq *h;
> > +       int ret;
> > +
> > +       if (!rdev_amount || !d || !d->map_event || !d->name)
> > +               return ERR_PTR(-EINVAL);
> > +
> > +       if (irq <= 0) {
> > +               dev_err(dev, "No IRQ\n");
> > +               return ERR_PTR(-EINVAL);
> 
> Why shadowing error code? Negative IRQ is anything but “no IRQ”.

This was a good point. The irq is passed here as parameter. From this
function's perspective the negative irq is invalid parameter - we don't
know how the caller has obtained it. Print could show the value
contained in irq though.

Now that you pointed this out I am unsure if this check is needed here.
If we check it, then I still think we should report -EINVAL for invalid
parameter. Other option is to just call the request_threaded_irq() -
log the IRQ request failure and return what request_threaded_irq()
returns. Do you think that would make sense?

> > +
> > +/**
> > + * regulator_irq_helper_cancel - drop IRQ based regulator
> > event/error notifier
> > + *
> > + * @handle:            Pointer to handle returned by a successful
> > call to
> > + *                     regulator_irq_helper(). Will be NULLed upon
> > return.
> > + *
> > + * The associated IRQ is released and work is cancelled when the
> > function
> > + * returns.
> > + */
> > +void regulator_irq_helper_cancel(void **handle)
> > +{
> > +       if (handle && *handle) {
> 
> Can handle ever be NULL here ? (Yes, I understand that you export
> this)

To tell the truth - I am not sure. I *guess* that if we allow this to
be NULL, then one *could* implement a driver for IC where IRQs are
optional, in a way that when IRQs are supported the pointer to handle
is valid, when IRQs aren't supported the pointer is NULL. (Why) do you
think we should skip the check?

>  
> > +               struct regulator_irq *h = *handle;
> > +
> > +               free_irq(h->irq, h);
> > +               if (h->desc.irq_off_ms)
> > +                       cancel_delayed_work_sync(&h->isr_work);
> > +
> > +               h = NULL;
> > +       }
> > +}
> > +EXPORT_SYMBOL_GPL(regulator_irq_helper_cancel);
> > +
> > +static void regulator_irq_helper_drop(struct device *dev, void
> > *res)
> > +{
> > +       regulator_irq_helper_cancel(res);
> > +}
> > +
> > +void *devm_regulator_irq_helper(struct device *dev,
> > +                                const struct regulator_irq_desc
> > *d, int irq,
> > +                                int irq_flags, int common_errs,
> > +                                int *per_rdev_errs,
> > +                                struct regulator_dev **rdev, int
> > rdev_amount)
> > +{
> > +       void **ptr;
> > +
> > +       ptr = devres_alloc(regulator_irq_helper_drop, sizeof(*ptr),
> > GFP_KERNEL);
> > +       if (!ptr)
> > +               return ERR_PTR(-ENOMEM);
> > +
> > +       *ptr = regulator_irq_helper(dev, d, irq, irq_flags,
> > common_errs,
> > +                                   per_rdev_errs, rdev,
> > rdev_amount);
> > +
> > +       if (IS_ERR(*ptr))
> > +               devres_free(ptr);
> > +       else
> > +               devres_add(dev, ptr);
> > +
> > +       return *ptr;
> 
> Why not to use devm_add_action{_or_reset}()?

I just followed the same approach that has been used in other regulator
functions. (drivers/regulator/devres.c)
OTOH, the devm_add_action makes this little bit simpler so I'll convert
to use it.

Mark, do you have a reason of not using devm_add_action() in devres.c?
Should devm_add_action() be used in some other functions there? And
should this be moved to devres.c?

Best Regards
	Matti Vaittinen
Andy Shevchenko April 7, 2021, 9:10 a.m. UTC | #2
On Wed, Apr 7, 2021 at 8:02 AM Matti Vaittinen
<matti.vaittinen@fi.rohmeurope.com> wrote:
>
> Morning Andy,
>
> Thanks for the review! By the way, is it me or did your mail-client
> spill this out using HTML?

It's Gmail from my mobile phone, sorry for that. We have to blame
Google that they don't think through.

> On Wed, 2021-04-07 at 01:44 +0300, Andy Shevchenko wrote:
> > On Tuesday, April 6, 2021, Matti Vaittinen <
> > matti.vaittinen@fi.rohmeurope.com> wrote:

...

> > > +       pr_emerg(msg);
> >
> > Oh là là, besides build bot complaints, this has serious security
> > implications. Never do like this.
>
> I'm not even trying to claim that was correct. And I did send a fixup -
> sorry for this. I don't intend to do this again.
>
> Now, when this is said - If you have a minute, please educate me.
> Assuming we know all the callers and that all the callers use this as
>
> die_loudly("foobarfoo\n");
> - what is the exploit mechanism?

Not a security guy, but my understanding is that this code may be used
as a gadget in ROP technique of attacks.
In that case msg can be anything. On top of that, somebody may
mistakenly (inadvertently) put the code that allows user controller
input to go to this path.

And last but not least, that some newbies might copy'n'paste bad
examples where they will expose security breach.

With the modern world of Spectre, rowhammer, and other side channel
attacks I may believe that one may exhaust the regulator for getting
advantage on an attack vector.

But again, not a security guy here.

> > > +       BUG();
> > > +}
> > > +

...

> > > errors will be
> > > + *                     or'ed with common erros. If this is given

errors ?

...

> > > +       if (irq <= 0) {
> > > +               dev_err(dev, "No IRQ\n");
> > > +               return ERR_PTR(-EINVAL);
> >
> > Why shadowing error code? Negative IRQ is anything but “no IRQ”.
>
> This was a good point. The irq is passed here as parameter. From this
> function's perspective the negative irq is invalid parameter - we don't
> know how the caller has obtained it. Print could show the value
> contained in irq though.

> Now that you pointed this out I am unsure if this check is needed here.
> If we check it, then I still think we should report -EINVAL for invalid
> parameter. Other option is to just call the request_threaded_irq() -
> log the IRQ request failure and return what request_threaded_irq()
> returns. Do you think that would make sense?

Why is the parameter signed type then?
Shouldn't the caller take care of it?

Otherwise, what is the difference between passing negative IRQ to
request_irq() call?
As you said, you shouldn't make assumptions about what caller meant by this.

So, I would simply drop the check (from easiness of the code perspective).

...

> > > +void regulator_irq_helper_cancel(void **handle)
> > > +{
> > > +       if (handle && *handle) {
> >
> > Can handle ever be NULL here ? (Yes, I understand that you export
> > this)
>
> To tell the truth - I am not sure. I *guess* that if we allow this to
> be NULL, then one *could* implement a driver for IC where IRQs are
> optional, in a way that when IRQs are supported the pointer to handle
> is valid, when IRQs aren't supported the pointer is NULL. (Why) do you
> think we should skip the check?

Just my guts feeling. I don't remember that I ever saw checks like
this for indirect pointers.
Of course it doesn't mean there are no such checks present or may be present.

...

> > Why not to use devm_add_action{_or_reset}()?
>
> I just followed the same approach that has been used in other regulator
> functions. (drivers/regulator/devres.c)
> OTOH, the devm_add_action makes this little bit simpler so I'll convert
> to use it.
>
> Mark, do you have a reason of not using devm_add_action() in devres.c?
> Should devm_add_action() be used in some other functions there? And
> should this be moved to devres.c?

I think the reason for this is as simple as a historical one, i.e.
there was no such API that time.
Matti Vaittinen April 7, 2021, 9:49 a.m. UTC | #3
Hello Andy,

On Wed, 2021-04-07 at 12:10 +0300, Andy Shevchenko wrote:
> On Wed, Apr 7, 2021 at 8:02 AM Matti Vaittinen
> <matti.vaittinen@fi.rohmeurope.com> wrote:
> > On Wed, 2021-04-07 at 01:44 +0300, Andy Shevchenko wrote:
> > > On Tuesday, April 6, 2021, Matti Vaittinen <
> > > matti.vaittinen@fi.rohmeurope.com> wrote:
> 
> ...
> 
> > > > +       pr_emerg(msg);
> > > 
> > > Oh là là, besides build bot complaints, this has serious security
> > > implications. Never do like this.
> > 
> > I'm not even trying to claim that was correct. And I did send a
> > fixup -
> > sorry for this. I don't intend to do this again.
> > 
> > Now, when this is said - If you have a minute, please educate me.
> > Assuming we know all the callers and that all the callers use this
> > as
> > 
> > die_loudly("foobarfoo\n");
> > - what is the exploit mechanism?
> 
> Not a security guy, but my understanding is that this code may be
> used
> as a gadget in ROP technique of attacks.

Thanks Andy. It'd be interesting to learn more details as I am not a
security expert either :)

> In that case msg can be anything. On top of that, somebody may
> mistakenly (inadvertently) put the code that allows user controller
> input to go to this path.

Yes. This is a good reason to not to do this - but I was interested in
knowing if there is a potential risk even if:

> > all the callers use this
> > as
> > 
> > die_loudly("foobarfoo\n");


> And last but not least, that some newbies might copy'n'paste bad
> examples where they will expose security breach.

Yes yes. As I said, I am not trying to say it is Ok. I was just
wondering what are the risks if users of the print function were known.

> With the modern world of Spectre, rowhammer, and other side channel
> attacks I may believe that one may exhaust the regulator for getting
> advantage on an attack vector.
> 
> But again, not a security guy here.

Thanks anyways :)

> 
> > > > +       BUG();
> > > > +}
> > > > +
> 
> ...
> 
> > > > errors will be
> > > > + *                     or'ed with common erros. If this is
> > > > given
> 
> errors ?

Thanks. I didn't first spot the typo even though you pointed it to me.
Luckily my evolution has occasional problems when communicating with
the mail server. I had enough time to hit the cancel before sending out
a message where I wondered how I should clarify this :]

> ...
> 
> > > > +       if (irq <= 0) {
> > > > +               dev_err(dev, "No IRQ\n");
> > > > +               return ERR_PTR(-EINVAL);
> > > 
> > > Why shadowing error code? Negative IRQ is anything but “no IRQ”.
> > 
> > This was a good point. The irq is passed here as parameter. From
> > this
> > function's perspective the negative irq is invalid parameter - we
> > don't
> > know how the caller has obtained it. Print could show the value
> > contained in irq though.
> > Now that you pointed this out I am unsure if this check is needed
> > here.
> > If we check it, then I still think we should report -EINVAL for
> > invalid
> > parameter. Other option is to just call the request_threaded_irq()
> > -
> > log the IRQ request failure and return what request_threaded_irq()
> > returns. Do you think that would make sense?
> 
> Why is the parameter signed type then?
> Shouldn't the caller take care of it?
> 
> Otherwise, what is the difference between passing negative IRQ to
> request_irq() call?
> As you said, you shouldn't make assumptions about what caller meant
> by this.
> 
> So, I would simply drop the check (from easiness of the code
> perspective).

Yep. I was going to drop the check. Good point. Thanks.
I'll send v6 shortly to address the issues you spotted Andy. Thanks.

> 
> ...
> 
> > > > +void regulator_irq_helper_cancel(void **handle)
> > > > +{
> > > > +       if (handle && *handle) {
> > > 
> > > Can handle ever be NULL here ? (Yes, I understand that you export
> > > this)
> > 
> > To tell the truth - I am not sure. I *guess* that if we allow this
> > to
> > be NULL, then one *could* implement a driver for IC where IRQs are
> > optional, in a way that when IRQs are supported the pointer to
> > handle
> > is valid, when IRQs aren't supported the pointer is NULL. (Why) do
> > you
> > think we should skip the check?
> 
> Just my guts feeling. I don't remember that I ever saw checks like
> this for indirect pointers.
> Of course it doesn't mean there are no such checks present or may be
> present.

I think I'll keep the check unless there is some reason why it should
be omitted.

> > > Why not to use devm_add_action{_or_reset}()?
> > 
> > I just followed the same approach that has been used in other
> > regulator
> > functions. (drivers/regulator/devres.c)
> > OTOH, the devm_add_action makes this little bit simpler so I'll
> > convert
> > to use it.
> > 
> > Mark, do you have a reason of not using devm_add_action() in
> > devres.c?
> > Should devm_add_action() be used in some other functions there? And
> > should this be moved to devres.c?
> 
> I think the reason for this is as simple as a historical one, i.e.
> there was no such API that time.

Right. This is probably the reason why they were written as they are. I
was just wondering if Mark had a reason to keep them that way - or if
he would appreciate it if one converted them to use the
devm_add_action() family of functions.

Best Regards
  Matti.
Andy Shevchenko April 7, 2021, 12:50 p.m. UTC | #4
On Wed, Apr 7, 2021 at 12:49 PM Vaittinen, Matti
<Matti.Vaittinen@fi.rohmeurope.com> wrote:
> On Wed, 2021-04-07 at 12:10 +0300, Andy Shevchenko wrote:
> > On Wed, Apr 7, 2021 at 8:02 AM Matti Vaittinen
> > <matti.vaittinen@fi.rohmeurope.com> wrote:
> > > On Wed, 2021-04-07 at 01:44 +0300, Andy Shevchenko wrote:
> > > > On Tuesday, April 6, 2021, Matti Vaittinen <
> > > > matti.vaittinen@fi.rohmeurope.com> wrote:

Kees, there are two non-security guys discussing potential security
matters. Perhaps you may shed a light on this and tell which of our
stuff is risky and which is not and your recommendations on it.

> > > > > +       pr_emerg(msg);
> > > >
> > > > Oh là là, besides build bot complaints, this has serious security
> > > > implications. Never do like this.
> > >
> > > I'm not even trying to claim that was correct. And I did send a
> > > fixup -
> > > sorry for this. I don't intend to do this again.
> > >
> > > Now, when this is said - If you have a minute, please educate me.
> > > Assuming we know all the callers and that all the callers use this
> > > as
> > >
> > > die_loudly("foobarfoo\n");
> > > - what is the exploit mechanism?
> >
> > Not a security guy, but my understanding is that this code may be
> > used
> > as a gadget in ROP technique of attacks.
>
> Thanks Andy. It'd be interesting to learn more details as I am not a
> security expert either :)
>
> > In that case msg can be anything. On top of that, somebody may
> > mistakenly (inadvertently) put the code that allows user controller
> > input to go to this path.
>
> Yes. This is a good reason to not to do this - but I was interested in
> knowing if there is a potential risk even if:
>
> > > all the callers use this
> > > as
> > >
> > > die_loudly("foobarfoo\n");

I don't see direct issues, only indirect ones, for example, if by some
reason the memory of this message appears writable. So, whoever
controls the format string of printf() controls a lot. That's why it's
preferable to spell out exact intentions in the explicit format
string.

> > And last but not least, that some newbies might copy'n'paste bad
> > examples where they will expose security breach.
>
> Yes yes. As I said, I am not trying to say it is Ok. I was just
> wondering what are the risks if users of the print function were known.
>
> > With the modern world of Spectre, rowhammer, and other side channel
> > attacks I may believe that one may exhaust the regulator for getting
> > advantage on an attack vector.
> >
> > But again, not a security guy here.
>
> Thanks anyways :)

> > > > > +       BUG();
> > > > > +}
Kees Cook April 9, 2021, 3:20 a.m. UTC | #5
On Wed, Apr 07, 2021 at 03:50:15PM +0300, Andy Shevchenko wrote:
> On Wed, Apr 7, 2021 at 12:49 PM Vaittinen, Matti
> <Matti.Vaittinen@fi.rohmeurope.com> wrote:
> > On Wed, 2021-04-07 at 12:10 +0300, Andy Shevchenko wrote:
> > > On Wed, Apr 7, 2021 at 8:02 AM Matti Vaittinen
> > > <matti.vaittinen@fi.rohmeurope.com> wrote:
> > > > On Wed, 2021-04-07 at 01:44 +0300, Andy Shevchenko wrote:
> > > > > On Tuesday, April 6, 2021, Matti Vaittinen <
> > > > > matti.vaittinen@fi.rohmeurope.com> wrote:
> 
> Kees, there are two non-security guys discussing potential security
> matters. Perhaps you may shed a light on this and tell which of our
> stuff is risky and which is not and your recommendations on it.

Hi!

> > > > > > +       pr_emerg(msg);
> > > > >
> > > > > Oh là là, besides build bot complaints, this has serious security
> > > > > implications. Never do like this.
> > > >
> > > > I'm not even trying to claim that was correct. And I did send a
> > > > fixup -
> > > > sorry for this. I don't intend to do this again.
> > > >
> > > > Now, when this is said - If you have a minute, please educate me.
> > > > Assuming we know all the callers and that all the callers use this
> > > > as
> > > >
> > > > die_loudly("foobarfoo\n");
> > > > - what is the exploit mechanism?

I may not be following the thread exactly, here, but normally the issue
is just one of robustness and code maintainability. You can't be sure all
future callers will always pass in a const string, so better to always do:

	pr_whatever("%s\n", string_var);

> > > Not a security guy, but my understanding is that this code may be
> > > used
> > > as a gadget in ROP technique of attacks.

The primary concern is with giving an attacker control over a format
string (which can be used to expose kernel memory). It used to be much
more serious when the kernel still implemented %n, which would turn such
things into a potential memory _overwrite_. We removed %n a long time
ago now. :)

> > Thanks Andy. It'd be interesting to learn more details as I am not a
> > security expert either :)
> >
> > > In that case msg can be anything. On top of that, somebody may
> > > mistakenly (inadvertently) put the code that allows user controller
> > > input to go to this path.
> >
> > Yes. This is a good reason to not to do this - but I was interested in
> > knowing if there is a potential risk even if:
> >
> > > > all the callers use this
> > > > as
> > > >
> > > > die_loudly("foobarfoo\n");
> 
> I don't see direct issues, only indirect ones, for example, if by some
> reason the memory of this message appears writable. So, whoever
> controls the format string of printf() controls a lot. That's why it's
> preferable to spell out exact intentions in the explicit format
> string.

Right.

> > > > > > +       BUG();
> > > > > > +}

This, though, are you sure you want to use BUG()? Linus gets upset about
such things:
https://www.kernel.org/doc/html/latest/process/deprecated.html#bug-and-bug-on
Matti Vaittinen April 9, 2021, 7:08 a.m. UTC | #6
On Thu, 2021-04-08 at 20:20 -0700, Kees Cook wrote:
> On Wed, Apr 07, 2021 at 03:50:15PM +0300, Andy Shevchenko wrote:
> > On Wed, Apr 7, 2021 at 12:49 PM Vaittinen, Matti
> > <Matti.Vaittinen@fi.rohmeurope.com> wrote:
> > > On Wed, 2021-04-07 at 12:10 +0300, Andy Shevchenko wrote:
> > > > On Wed, Apr 7, 2021 at 8:02 AM Matti Vaittinen
> > > > <matti.vaittinen@fi.rohmeurope.com> wrote:
> > > > > On Wed, 2021-04-07 at 01:44 +0300, Andy Shevchenko wrote:
> > > > > > On Tuesday, April 6, 2021, Matti Vaittinen <
> > > > > > matti.vaittinen@fi.rohmeurope.com> wrote:
> > > > > > > +       BUG();
> > > > > > > +}
> 
> This, though, are you sure you want to use BUG()? Linus gets upset
> about
> such things:
> https://www.kernel.org/doc/html/latest/process/deprecated.html#bug-and-bug-on
> 

I see. I am unsure of what would be the best action in the regulator
case we are handling here. To give the context, we assume here a
situation where power has gone out of regulation and the hardware is
probably failing. First countermeasure to protect what is left of HW is
to shut-down the failing regulator. BUG() was called here as a last
resort if shutting the power via regulator interface was not
implemented or working.

Eg, we try to take what ever last measure we can to minimize the HW
damage - and BUG() was used for this in the qcom driver where I stole
the idea. Judging the comment related to BUG() in asm-generic/bug.h

/*
 * Don't use BUG() or BUG_ON() unless there's really no way out; one
 
* example might be detecting data structure corruption in the middle
 *
of an operation that can't be backed out of.  If the (sub)system
 * can
somehow continue operating, perhaps with reduced functionality,
 * it's
probably not BUG-worthy.
 *
 * If you're tempted to BUG(), think
again:  is completely giving up
 * really the *only* solution?  There
are usually better options, where
 * users don't need to reboot ASAP and
can mostly shut down cleanly.
 */
https://elixir.bootlin.com/linux/v5.12-rc6/source/include/asm-generic/bug.h#L55

this really might be valid use-case.

To me the real question is what happens after the BUG() - and if there
is any generic handling or if it is platform/board specific? Does it
actually have any chance to save the HW?

Mark already pointed that we might need to figure a way to punt a
"failing event" to the user-space to initiate better "safety shutdown".
Such event does not currently exist so I think the main use-case here
is to do logging and potentially prevent enabling any further actions
in the failing HW.

So - any better suggestions?

Best Regards
	Matti Vaittinen
Matti Vaittinen April 12, 2021, 12:24 p.m. UTC | #7
On Fri, 2021-04-09 at 10:08 +0300, Matti Vaittinen wrote:
> On Thu, 2021-04-08 at 20:20 -0700, Kees Cook wrote:
> > On Wed, Apr 07, 2021 at 03:50:15PM +0300, Andy Shevchenko wrote:
> > > On Wed, Apr 7, 2021 at 12:49 PM Vaittinen, Matti
> > > <Matti.Vaittinen@fi.rohmeurope.com> wrote:
> > > > On Wed, 2021-04-07 at 12:10 +0300, Andy Shevchenko wrote:
> > > > > On Wed, Apr 7, 2021 at 8:02 AM Matti Vaittinen
> > > > > <matti.vaittinen@fi.rohmeurope.com> wrote:
> > > > > > On Wed, 2021-04-07 at 01:44 +0300, Andy Shevchenko wrote:
> > > > > > > On Tuesday, April 6, 2021, Matti Vaittinen <
> > > > > > > matti.vaittinen@fi.rohmeurope.com> wrote:
> > > > > > > > +       BUG();
> > > > > > > > +}
> > 
> > This, though, are you sure you want to use BUG()? Linus gets upset
> > about
> > such things:
> > https://www.kernel.org/doc/html/latest/process/deprecated.html#bug-and-bug-on
> > 
> 
> I see. I am unsure of what would be the best action in the regulator
> case we are handling here. To give the context, we assume here a
> situation where power has gone out of regulation and the hardware is
> probably failing. First countermeasure to protect what is left of HW
> is
> to shut-down the failing regulator. BUG() was called here as a last
> resort if shutting the power via regulator interface was not
> implemented or working.
> 
> Eg, we try to take what ever last measure we can to minimize the HW
> damage - and BUG() was used for this in the qcom driver where I stole
> the idea. Judging the comment related to BUG() in asm-generic/bug.h
> 
> /*
>  * Don't use BUG() or BUG_ON() unless there's really no way out; one
>  
> * example might be detecting data structure corruption in the middle
>  *
> of an operation that can't be backed out of.  If the (sub)system
>  * can
> somehow continue operating, perhaps with reduced functionality,
>  * it's
> probably not BUG-worthy.
>  *
>  * If you're tempted to BUG(), think
> again:  is completely giving up
>  * really the *only* solution?  There
> are usually better options, where
>  * users don't need to reboot ASAP and
> can mostly shut down cleanly.
>  */
> https://elixir.bootlin.com/linux/v5.12-rc6/source/include/asm-generic/bug.h#L55
> 
> this really might be valid use-case.
> 
> To me the real question is what happens after the BUG() - and if
> there
> is any generic handling or if it is platform/board specific? Does it
> actually have any chance to save the HW?
> 
> Mark already pointed that we might need to figure a way to punt a
> "failing event" to the user-space to initiate better "safety
> shutdown".
> Such event does not currently exist so I think the main use-case here
> is to do logging and potentially prevent enabling any further actions
> in the failing HW.
> 
> So - any better suggestions?
> 

Maybe we should take same approach as is taken in thermal_core? Quoting
the thermal documentation:

"On an event of critical trip temperature crossing. Thermal
framework             
allows the system to shutdown gracefully by calling
orderly_poweroff().          
In the event of a failure of orderly_poweroff() to shut down the
system          
we are in danger of keeping the system alive at undesirably
high                 
temperatures. To mitigate this high risk scenario we program a
work              
queue to fire after a pre-determined number of seconds to
start                  
an emergency shutdown of the device using the
kernel_power_off()                 
function. In case kernel_power_off() fails then
finally                          
emergency_restart() is called in the worst case."

Maybe this 'hardware protection, in-kernel, emergency HW saving
shutdown' - logic, should be pulled out of thermal_core.c (or at least
exported) for (other parts like) the regulators to use?

I don't like the idea relying in the user-space to be in shape it can
handle the situation. I may be mistaken but I think a quick action
might be required. Hence the in-kernel handling does not sound so bad
to me.

I am open to all education and suggestions. Meanwhile I am planning to
just convert the BUG() to WARN(). I don't claim I know how BUG() is
implemented on each platform - but my understanding is that it does not
guarantee any power to be cut but just halts the calling process(?). I
guess this does not guarantee what happens next - maybe it even keeps
the power enabled and end up just deadlocking the system by reserved
locks? I think thermal guys have been pondering this scenario for
severe temperature protection shutdown so I would like to hear your
opinions.


Best Regards
Matti Vaittinen
Mark Brown April 12, 2021, 1:09 p.m. UTC | #8
On Mon, Apr 12, 2021 at 03:24:16PM +0300, Matti Vaittinen wrote:

> Maybe this 'hardware protection, in-kernel, emergency HW saving
> shutdown' - logic, should be pulled out of thermal_core.c (or at least
> exported) for (other parts like) the regulators to use?

That sounds sensible.