Message ID | 20200827133002.369439-1-andrew@aj.id.au |
---|---|
Headers | show |
Series | MAX31785 Fan Controller Work-arounds | expand |
On Thu, 27 Aug 2020 at 13:30, Andrew Jeffery <andrew@aj.id.au> wrote: > > Hello, > > This series works around reliability problems with the MAX31785 fan controller > as observed in the field on some POWER systems. > > I'm the first to admit the patches are not elegant, so feedback there is > appreciated. The way you implemented the loop took me several goes to grok. I finally got there. > > Separately, our previous workarounds have run aground upstream as the hwmon > maintainer was unable to reproduce our observations on the MAX31785 evaluation > kit. I've recently received an evaluation kit, so I plan on putting some of > these issues to the test myself. Ultimately this will help determine whether we > have issues with our fan card designs or whether the controller itself is at > fault (I have to admit, given some of the failures, it's hard to see how the > controller might not be at fault). Basically, this paragraph is my excuse for > not pushing these patches further upstream for the moment; I will re-evaluate > once I have the results from testing against the evkit. I would post this series upstream so Guneter has some context for future patches that come out of your investigation. > > In the mean time, these patches resolve issues we've seen in some system > deployments. Taken together, I've put the driver through an unbind/bind loop > of over 20,000 iterations with no "loss" of fans, where prior to the series we > typically achieved only a few hundred. This feels like a significant > improvement, so please consider merging despite their ugliness. Do you want these in dev-5.4, or both 5.4 and 5.8? > > Cheers, > > Andrew > > Andrew Jeffery (3): > pmbus: (max31785) Retry enabling fans after writing MFR_FAN_CONFIG > pmbus: (max31785) Add a local pmbus_set_page() implementation > pmbus: (core) Add a one-shot retry in pmbus_set_page() > > drivers/hwmon/pmbus/max31785.c | 55 ++++++++++++++++++++++++++------ > drivers/hwmon/pmbus/pmbus_core.c | 33 ++++++++++++------- > 2 files changed, 66 insertions(+), 22 deletions(-) > > -- > 2.25.1 >
On Tue, 1 Sep 2020, at 15:51, Joel Stanley wrote: > On Thu, 27 Aug 2020 at 13:30, Andrew Jeffery <andrew@aj.id.au> wrote: > > > > Hello, > > > > This series works around reliability problems with the MAX31785 fan controller > > as observed in the field on some POWER systems. > > > > I'm the first to admit the patches are not elegant, so feedback there is > > appreciated. > > The way you implemented the loop took me several goes to grok. I > finally got there. > > > > > > Separately, our previous workarounds have run aground upstream as the hwmon > > maintainer was unable to reproduce our observations on the MAX31785 evaluation > > kit. I've recently received an evaluation kit, so I plan on putting some of > > these issues to the test myself. Ultimately this will help determine whether we > > have issues with our fan card designs or whether the controller itself is at > > fault (I have to admit, given some of the failures, it's hard to see how the > > controller might not be at fault). Basically, this paragraph is my excuse for > > not pushing these patches further upstream for the moment; I will re-evaluate > > once I have the results from testing against the evkit. > > I would post this series upstream so Guneter has some context for > future patches that come out of your investigation. > > > > > In the mean time, these patches resolve issues we've seen in some system > > deployments. Taken together, I've put the driver through an unbind/bind loop > > of over 20,000 iterations with no "loss" of fans, where prior to the series we > > typically achieved only a few hundred. This feels like a significant > > improvement, so please consider merging despite their ugliness. > > Do you want these in dev-5.4, or both 5.4 and 5.8? A bit late to respond, but can we apply these to 5.8 as well? I've experimented with applying the same workaround as we did for the UCD90320, but it looks like a bad idea: We're not actually monitoring the UCD90320 sensors at runtime and so this hides the scheduling latency/jank that the I2C throttling patches introduce. This jank is exposed by the fact that unlike the UCD90320 we do monitor the MAX31785 sensors. Decreasing the jank by increasing the upper bound of usleep_range() to e.g. a full tick leads to unacceptable latencies when reading the MAX31785 sensor data (one read blows out to ~2.8 seconds due to the caching strategy implemented in pmbus_core.c). Cheers, Andrew