mbox series

[linux,dev-5.8,0/3] MAX31785 Fan Controller Work-arounds

Message ID 20200827133002.369439-1-andrew@aj.id.au
Headers show
Series MAX31785 Fan Controller Work-arounds | expand

Message

Andrew Jeffery Aug. 27, 2020, 1:29 p.m. UTC
Hello,

This series works around reliability problems with the MAX31785 fan controller
as observed in the field on some POWER systems.

I'm the first to admit the patches are not elegant, so feedback there is
appreciated.

Separately, our previous workarounds have run aground upstream as the hwmon
maintainer was unable to reproduce our observations on the MAX31785 evaluation
kit. I've recently received an evaluation kit, so I plan on putting some of
these issues to the test myself. Ultimately this will help determine whether we
have issues with our fan card designs or whether the controller itself is at
fault (I have to admit, given some of the failures, it's hard to see how the
controller might not be at fault). Basically, this paragraph is my excuse for
not pushing these patches further upstream for the moment; I will re-evaluate
once I have the results from testing against the evkit.

In the mean time, these patches resolve issues we've seen in some system
deployments. Taken together, I've put the driver through an unbind/bind loop
of over 20,000 iterations with no "loss" of fans, where prior to the series we
typically achieved only a few hundred. This feels like a significant
improvement, so please consider merging despite their ugliness.

Cheers,

Andrew

Andrew Jeffery (3):
  pmbus: (max31785) Retry enabling fans after writing MFR_FAN_CONFIG
  pmbus: (max31785) Add a local pmbus_set_page() implementation
  pmbus: (core) Add a one-shot retry in pmbus_set_page()

 drivers/hwmon/pmbus/max31785.c   | 55 ++++++++++++++++++++++++++------
 drivers/hwmon/pmbus/pmbus_core.c | 33 ++++++++++++-------
 2 files changed, 66 insertions(+), 22 deletions(-)

Comments

Joel Stanley Sept. 1, 2020, 6:21 a.m. UTC | #1
On Thu, 27 Aug 2020 at 13:30, Andrew Jeffery <andrew@aj.id.au> wrote:
>
> Hello,
>
> This series works around reliability problems with the MAX31785 fan controller
> as observed in the field on some POWER systems.
>
> I'm the first to admit the patches are not elegant, so feedback there is
> appreciated.

The way you implemented the loop took me several goes to grok. I
finally got there.


>
> Separately, our previous workarounds have run aground upstream as the hwmon
> maintainer was unable to reproduce our observations on the MAX31785 evaluation
> kit. I've recently received an evaluation kit, so I plan on putting some of
> these issues to the test myself. Ultimately this will help determine whether we
> have issues with our fan card designs or whether the controller itself is at
> fault (I have to admit, given some of the failures, it's hard to see how the
> controller might not be at fault). Basically, this paragraph is my excuse for
> not pushing these patches further upstream for the moment; I will re-evaluate
> once I have the results from testing against the evkit.

I would post this series upstream so Guneter has some context for
future patches that come out of your investigation.

>
> In the mean time, these patches resolve issues we've seen in some system
> deployments. Taken together, I've put the driver through an unbind/bind loop
> of over 20,000 iterations with no "loss" of fans, where prior to the series we
> typically achieved only a few hundred. This feels like a significant
> improvement, so please consider merging despite their ugliness.

Do you want these in dev-5.4, or both 5.4 and 5.8?

>
> Cheers,
>
> Andrew
>
> Andrew Jeffery (3):
>   pmbus: (max31785) Retry enabling fans after writing MFR_FAN_CONFIG
>   pmbus: (max31785) Add a local pmbus_set_page() implementation
>   pmbus: (core) Add a one-shot retry in pmbus_set_page()
>
>  drivers/hwmon/pmbus/max31785.c   | 55 ++++++++++++++++++++++++++------
>  drivers/hwmon/pmbus/pmbus_core.c | 33 ++++++++++++-------
>  2 files changed, 66 insertions(+), 22 deletions(-)
>
> --
> 2.25.1
>
Andrew Jeffery Oct. 26, 2020, 12:45 a.m. UTC | #2
On Tue, 1 Sep 2020, at 15:51, Joel Stanley wrote:
> On Thu, 27 Aug 2020 at 13:30, Andrew Jeffery <andrew@aj.id.au> wrote:
> >
> > Hello,
> >
> > This series works around reliability problems with the MAX31785 fan controller
> > as observed in the field on some POWER systems.
> >
> > I'm the first to admit the patches are not elegant, so feedback there is
> > appreciated.
> 
> The way you implemented the loop took me several goes to grok. I
> finally got there.
> 
> 
> >
> > Separately, our previous workarounds have run aground upstream as the hwmon
> > maintainer was unable to reproduce our observations on the MAX31785 evaluation
> > kit. I've recently received an evaluation kit, so I plan on putting some of
> > these issues to the test myself. Ultimately this will help determine whether we
> > have issues with our fan card designs or whether the controller itself is at
> > fault (I have to admit, given some of the failures, it's hard to see how the
> > controller might not be at fault). Basically, this paragraph is my excuse for
> > not pushing these patches further upstream for the moment; I will re-evaluate
> > once I have the results from testing against the evkit.
> 
> I would post this series upstream so Guneter has some context for
> future patches that come out of your investigation.
> 
> >
> > In the mean time, these patches resolve issues we've seen in some system
> > deployments. Taken together, I've put the driver through an unbind/bind loop
> > of over 20,000 iterations with no "loss" of fans, where prior to the series we
> > typically achieved only a few hundred. This feels like a significant
> > improvement, so please consider merging despite their ugliness.
> 
> Do you want these in dev-5.4, or both 5.4 and 5.8?

A bit late to respond, but can we apply these to 5.8 as well? I've experimented
with applying the same workaround as we did for the UCD90320, but it looks
like a bad idea: We're not actually monitoring the UCD90320 sensors at runtime
and so this hides the scheduling latency/jank that the I2C throttling patches
introduce. This jank is exposed by the fact that unlike the UCD90320 we do
monitor the MAX31785 sensors. Decreasing the jank by increasing the upper
bound of usleep_range() to e.g. a full tick leads to unacceptable latencies when
reading the MAX31785 sensor data (one read blows out to ~2.8 seconds due
to the caching strategy implemented in pmbus_core.c).

Cheers,

Andrew