mbox: Harden against BMC daemon errors

Message ID	20180322033235.8940-1-cyril.bur@au1.ibm.com
State	Superseded
Headers	show Return-Path: <skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org> Gateway: Authorized Use Only! Violators will be prosecuted for <skiboot@lists.ozlabs.org> from <cyril.bur@au1.ibm.com>; Thu, 22 Mar 2018 03:32:54 -0000 Gateway: Authorized Use Only! Violators will be prosecuted; Thu, 22 Mar 2018 03:32:52 -0000 From: Cyril Bur <cyril.bur@au1.ibm.com> To: skiboot@lists.ozlabs.org Date: Thu, 22 Mar 2018 14:32:35 +1100 Message-Id: <20180322033235.8940-1-cyril.bur@au1.ibm.com> Subject: [Skiboot] [PATCH] mbox: Harden against BMC daemon errors Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" <skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org>
Series	mbox: Harden against BMC daemon errors \| expand mbox: Harden against BMC daemon errors

Cyril Bur March 22, 2018, 3:32 a.m. UTC

Bugs present in the BMC daemon mean that skiboot gets presented with
mbox windows of size zero. These windows cannot be valid and skiboot
already detects these conditions.

Currently skiboot warns quite strongly about the occurrence of these
problems. The problem for skiboot is that it doesn't take any action.
Initially I wanting to avoid putting policy like this into skiboot but
since these bugs aren't going away and skiboot barfing is leading to
lockups and ultimately the host going down something needs to be done.

I propose that when we detect the problem we fail the mbox call and punt
the problem back up to Linux. I don't like it but at least it will cause
errors to cascade and won't bring the host down. I'm not sure how Linux
is supposed to detect this or what it can even do but this is better
than a crash.

Diagnosing a failure to boot if skiboot its self fails to read flash may
be marginally more difficult with this patch. This is because skiboot
will now only print one warning about the zero sized window rather than
continuously spitting it out.

Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Tested-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Cyril Bur <cyril.bur@au1.ibm.com>
---
 libflash/mbox-flash.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

Nicholas Piggin March 22, 2018, 4:01 a.m. UTC | #1

Thanks!

On Thu, 22 Mar 2018 14:32:35 +1100
Cyril Bur <cyril.bur@au1.ibm.com> wrote:

> Bugs present in the BMC daemon mean that skiboot gets presented with
> mbox windows of size zero. These windows cannot be valid and skiboot
> already detects these conditions.
> 
> Currently skiboot warns quite strongly about the occurrence of these
> problems. The problem for skiboot is that it doesn't take any action.
> Initially I wanting to avoid putting policy like this into skiboot but
> since these bugs aren't going away and skiboot barfing is leading to
> lockups and ultimately the host going down something needs to be done.
> 
> I propose that when we detect the problem we fail the mbox call and punt
> the problem back up to Linux. I don't like it but at least it will cause
> errors to cascade and won't bring the host down. I'm not sure how Linux
> is supposed to detect this or what it can even do but this is better
> than a crash.
> 
> Diagnosing a failure to boot if skiboot its self fails to read flash may
> be marginally more difficult with this patch. This is because skiboot
> will now only print one warning about the zero sized window rather than
> continuously spitting it out.
> 
> Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
> Tested-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
> Signed-off-by: Cyril Bur <cyril.bur@au1.ibm.com>
> ---
>  libflash/mbox-flash.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/libflash/mbox-flash.c b/libflash/mbox-flash.c
> index 4a3c53f5..70f43f36 100644
> --- a/libflash/mbox-flash.c
> +++ b/libflash/mbox-flash.c
> @@ -334,15 +334,14 @@ static int wait_for_bmc(struct mbox_flash_data *mbox_flash, unsigned int timeout
>  {
>  	unsigned long last = 1, start = tb_to_secs(mftb());
>  	prlog(PR_TRACE, "Waiting for BMC\n");
> -	while (mbox_flash->busy && timeout_sec) {
> +	while (mbox_flash->busy && timeout_sec > last) {
>  		long now = tb_to_secs(mftb());
>  		if (now - start > last) {
> -			timeout_sec--;
> -			last = now - start;
>  			if (last < timeout_sec / 2)
>  				prlog(PR_TRACE, "Been waiting for the BMC for %lu secs\n", last);
>  			else
>  				prlog(PR_ERR, "BMC NOT RESPONDING %lu second wait\n", last);
> +			last++;
>  		}
>  		/*
>  		 * Both functions are important.

This looks like another change got pulled in. Was this intentional?
It doesn't seem like it changes the logic.

While I'm looking at this code... wow, 30 seconds is a long time to
wait in firmware with interrupts disabled. 3 seconds is too. Anything
outside crashing, hardware failure handling, and genuine offline
operations (e.g., flashing firmware) and we should be able to count
the milliseconds in firmware.

If this is only for writing flash, well then a few seconds seems fine.
What's the minimum reasonable value? Something under 10 seconds would
be good so it doesn't trip the hard lockup watchdog. 3 seconds?

Is there anything more useful you can tell the administrator in the
error message? Should they ensure the BMC is running? Power cycle it?


> @@ -709,6 +708,12 @@ static int mbox_window_move(struct mbox_flash_data *mbox_flash,
>  		prlog(PR_ERR, "Move window is indicating size zero!\n");
>  		prlog(PR_ERR, "pos: 0x%" PRIx64 ", len: 0x%" PRIx64 "\n", pos, len);
>  		prlog(PR_ERR, "win pos: 0x%08x win size: 0x%08x\n", win->cur_pos, win->size);
> +		/*
> +		 * In practice skiboot gets stuck and this eventually
> +		 * brings down the host. Just fail pass the error back
> +		 * up and hope someone makes a good decision
> +		 */
> +		return MBOX_R_SYSTEM_ERROR;
>  	}
>  
>  	return rc;

What call chains lead here? Will the callers handle the error properly?
(e.g., will this give something like EIO writing to /dev/nvram?)

Thanks,
Nick

Cyril Bur March 22, 2018, 4:31 a.m. UTC | #2

On Thu, 2018-03-22 at 14:01 +1000, Nicholas Piggin wrote:
> Thanks!
> 
> On Thu, 22 Mar 2018 14:32:35 +1100
> Cyril Bur <cyril.bur@au1.ibm.com> wrote:
> 
> > Bugs present in the BMC daemon mean that skiboot gets presented with
> > mbox windows of size zero. These windows cannot be valid and skiboot
> > already detects these conditions.
> > 
> > Currently skiboot warns quite strongly about the occurrence of these
> > problems. The problem for skiboot is that it doesn't take any action.
> > Initially I wanting to avoid putting policy like this into skiboot but
> > since these bugs aren't going away and skiboot barfing is leading to
> > lockups and ultimately the host going down something needs to be done.
> > 
> > I propose that when we detect the problem we fail the mbox call and punt
> > the problem back up to Linux. I don't like it but at least it will cause
> > errors to cascade and won't bring the host down. I'm not sure how Linux
> > is supposed to detect this or what it can even do but this is better
> > than a crash.
> > 
> > Diagnosing a failure to boot if skiboot its self fails to read flash may
> > be marginally more difficult with this patch. This is because skiboot
> > will now only print one warning about the zero sized window rather than
> > continuously spitting it out.
> > 
> > Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
> > Tested-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
> > Signed-off-by: Cyril Bur <cyril.bur@au1.ibm.com>
> > ---
> >  libflash/mbox-flash.c | 11 ++++++++---
> >  1 file changed, 8 insertions(+), 3 deletions(-)
> > 
> > diff --git a/libflash/mbox-flash.c b/libflash/mbox-flash.c
> > index 4a3c53f5..70f43f36 100644
> > --- a/libflash/mbox-flash.c
> > +++ b/libflash/mbox-flash.c
> > @@ -334,15 +334,14 @@ static int wait_for_bmc(struct mbox_flash_data *mbox_flash, unsigned int timeout
> >  {
> >  	unsigned long last = 1, start = tb_to_secs(mftb());
> >  	prlog(PR_TRACE, "Waiting for BMC\n");
> > -	while (mbox_flash->busy && timeout_sec) {
> > +	while (mbox_flash->busy && timeout_sec > last) {
> >  		long now = tb_to_secs(mftb());
> >  		if (now - start > last) {
> > -			timeout_sec--;
> > -			last = now - start;
> >  			if (last < timeout_sec / 2)
> >  				prlog(PR_TRACE, "Been waiting for the BMC for %lu secs\n", last);
> >  			else
> >  				prlog(PR_ERR, "BMC NOT RESPONDING %lu second wait\n", last);
> > +			last++;
> >  		}
> >  		/*
> >  		 * Both functions are important.
> 
> This looks like another change got pulled in. Was this intentional?
> It doesn't seem like it changes the logic.
> 

Yeah I messed with it a bit, thought I'd throw that in here, no change.
I was wanting to start some discussion around exactly what you're
saying so that's good.

> While I'm looking at this code... wow, 30 seconds is a long time to
> wait in firmware with interrupts disabled. 3 seconds is too. Anything
> outside crashing, hardware failure handling, and genuine offline
> operations (e.g., flashing firmware) and we should be able to count
> the milliseconds in firmware.
> 
> If this is only for writing flash, well then a few seconds seems fine.
> What's the minimum reasonable value? Something under 10 seconds would
> be good so it doesn't trip the hard lockup watchdog. 3 seconds?
> 

Well as discussed before, all bets are off now since it seems the BMC
won't tell us of problems in any kind of timely manner. I could move
that hunk into a patch that just hardcodes 3 seconds everywhere (i'm
glad we put effort into putting it into the mbox protocol so that the
BMC wouldn't do it)

> Is there anything more useful you can tell the administrator in the
> error message? Should they ensure the BMC is running? Power cycle it?
> 

Could add a mention that they need to powercycle their BMC but if we've
gone that bad... who knows what state the host is in...

> 
> > @@ -709,6 +708,12 @@ static int mbox_window_move(struct mbox_flash_data *mbox_flash,
> >  		prlog(PR_ERR, "Move window is indicating size zero!\n");
> >  		prlog(PR_ERR, "pos: 0x%" PRIx64 ", len: 0x%" PRIx64 "\n", pos, len);
> >  		prlog(PR_ERR, "win pos: 0x%08x win size: 0x%08x\n", win->cur_pos, win->size);
> > +		/*
> > +		 * In practice skiboot gets stuck and this eventually
> > +		 * brings down the host. Just fail pass the error back
> > +		 * up and hope someone makes a good decision
> > +		 */
> > +		return MBOX_R_SYSTEM_ERROR;
> >  	}
> >  
> >  	return rc;
> 
> What call chains lead here? Will the callers handle the error properly?
> (e.g., will this give something like EIO writing to /dev/nvram?)
> 

/dev/nvram is a bit of a special case since skiboot buffers it. A
failing attempt to write it back would silently fail (as far as linux
is concerned).

/dev/mtd by the powernv_flash driver would give EIO (or some other
failure), pflash does get a failed read when trying to open the FFS
header from host userspace. As tested by Pridhiviraj.

Cyril

> Thanks,
> Nick
>

Nicholas Piggin March 22, 2018, 5:15 a.m. UTC | #3

On Thu, 22 Mar 2018 15:31:46 +1100
Cyril Bur <cyril.bur@au1.ibm.com> wrote:

> On Thu, 2018-03-22 at 14:01 +1000, Nicholas Piggin wrote:
> > Thanks!
> > 
> > On Thu, 22 Mar 2018 14:32:35 +1100
> > Cyril Bur <cyril.bur@au1.ibm.com> wrote:
> >   
> > > Bugs present in the BMC daemon mean that skiboot gets presented with
> > > mbox windows of size zero. These windows cannot be valid and skiboot
> > > already detects these conditions.
> > > 
> > > Currently skiboot warns quite strongly about the occurrence of these
> > > problems. The problem for skiboot is that it doesn't take any action.
> > > Initially I wanting to avoid putting policy like this into skiboot but
> > > since these bugs aren't going away and skiboot barfing is leading to
> > > lockups and ultimately the host going down something needs to be done.
> > > 
> > > I propose that when we detect the problem we fail the mbox call and punt
> > > the problem back up to Linux. I don't like it but at least it will cause
> > > errors to cascade and won't bring the host down. I'm not sure how Linux
> > > is supposed to detect this or what it can even do but this is better
> > > than a crash.
> > > 
> > > Diagnosing a failure to boot if skiboot its self fails to read flash may
> > > be marginally more difficult with this patch. This is because skiboot
> > > will now only print one warning about the zero sized window rather than
> > > continuously spitting it out.
> > > 
> > > Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
> > > Tested-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
> > > Signed-off-by: Cyril Bur <cyril.bur@au1.ibm.com>
> > > ---
> > >  libflash/mbox-flash.c | 11 ++++++++---
> > >  1 file changed, 8 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/libflash/mbox-flash.c b/libflash/mbox-flash.c
> > > index 4a3c53f5..70f43f36 100644
> > > --- a/libflash/mbox-flash.c
> > > +++ b/libflash/mbox-flash.c
> > > @@ -334,15 +334,14 @@ static int wait_for_bmc(struct mbox_flash_data *mbox_flash, unsigned int timeout
> > >  {
> > >  	unsigned long last = 1, start = tb_to_secs(mftb());
> > >  	prlog(PR_TRACE, "Waiting for BMC\n");
> > > -	while (mbox_flash->busy && timeout_sec) {
> > > +	while (mbox_flash->busy && timeout_sec > last) {
> > >  		long now = tb_to_secs(mftb());
> > >  		if (now - start > last) {
> > > -			timeout_sec--;
> > > -			last = now - start;
> > >  			if (last < timeout_sec / 2)
> > >  				prlog(PR_TRACE, "Been waiting for the BMC for %lu secs\n", last);
> > >  			else
> > >  				prlog(PR_ERR, "BMC NOT RESPONDING %lu second wait\n", last);
> > > +			last++;
> > >  		}
> > >  		/*
> > >  		 * Both functions are important.  
> > 
> > This looks like another change got pulled in. Was this intentional?
> > It doesn't seem like it changes the logic.
> >   
> 
> Yeah I messed with it a bit, thought I'd throw that in here, no change.
> I was wanting to start some discussion around exactly what you're
> saying so that's good.
> 
> > While I'm looking at this code... wow, 30 seconds is a long time to
> > wait in firmware with interrupts disabled. 3 seconds is too. Anything
> > outside crashing, hardware failure handling, and genuine offline
> > operations (e.g., flashing firmware) and we should be able to count
> > the milliseconds in firmware.
> > 
> > If this is only for writing flash, well then a few seconds seems fine.
> > What's the minimum reasonable value? Something under 10 seconds would
> > be good so it doesn't trip the hard lockup watchdog. 3 seconds?
> >   
> 
> Well as discussed before, all bets are off now since it seems the BMC
> won't tell us of problems in any kind of timely manner. I could move
> that hunk into a patch

If it could be split and this one fleshed out, that would be good.

> that just hardcodes 3 seconds everywhere (i'm
> glad we put effort into putting it into the mbox protocol so that the
> BMC wouldn't do it)

That would be good if we can, or even lower. What's the consequence
of timing out here in high level terms?  Will it brick your machine,
or just some tool just error out?

And what's the response time of the BMC look like in worst conditions?
Basically we want the point where waiting for longer will be worse
than failing. 

> 
> > Is there anything more useful you can tell the administrator in the
> > error message? Should they ensure the BMC is running? Power cycle it?
> >   
> 
> Could add a mention that they need to powercycle their BMC but if we've
> gone that bad...

If there is something concise and reasonable we can put in the log
to help the admin out, that's always good.

> who knows what state the host is in...
> 
> >   
> > > @@ -709,6 +708,12 @@ static int mbox_window_move(struct mbox_flash_data *mbox_flash,
> > >  		prlog(PR_ERR, "Move window is indicating size zero!\n");
> > >  		prlog(PR_ERR, "pos: 0x%" PRIx64 ", len: 0x%" PRIx64 "\n", pos, len);
> > >  		prlog(PR_ERR, "win pos: 0x%08x win size: 0x%08x\n", win->cur_pos, win->size);
> > > +		/*
> > > +		 * In practice skiboot gets stuck and this eventually
> > > +		 * brings down the host. Just fail pass the error back
> > > +		 * up and hope someone makes a good decision
> > > +		 */
> > > +		return MBOX_R_SYSTEM_ERROR;
> > >  	}
> > >  
> > >  	return rc;  
> > 
> > What call chains lead here? Will the callers handle the error properly?
> > (e.g., will this give something like EIO writing to /dev/nvram?)
> >   
> 
> /dev/nvram is a bit of a special case since skiboot buffers it. A
> failing attempt to write it back would silently fail (as far as linux
> is concerned).
> 
> /dev/mtd by the powernv_flash driver would give EIO (or some other
> failure), pflash does get a failed read when trying to open the FFS
> header from host userspace. As tested by Pridhiviraj.

If there's a friendlier message you could add here too. I mean if it's
an expected condition with the firmware stack... Can you say

"Flash read failed due to BMC mbox daemon indicating 0 size window! Do xxx to fix it."

Thanks,
Nick

Cyril Bur March 22, 2018, 5:54 a.m. UTC | #4

On Thu, 2018-03-22 at 15:15 +1000, Nicholas Piggin wrote:
> On Thu, 22 Mar 2018 15:31:46 +1100
> Cyril Bur <cyril.bur@au1.ibm.com> wrote:
> 
> > On Thu, 2018-03-22 at 14:01 +1000, Nicholas Piggin wrote:
> > > Thanks!
> > > 
> > > On Thu, 22 Mar 2018 14:32:35 +1100
> > > Cyril Bur <cyril.bur@au1.ibm.com> wrote:
> > >   
> > > > Bugs present in the BMC daemon mean that skiboot gets presented with
> > > > mbox windows of size zero. These windows cannot be valid and skiboot
> > > > already detects these conditions.
> > > > 
> > > > Currently skiboot warns quite strongly about the occurrence of these
> > > > problems. The problem for skiboot is that it doesn't take any action.
> > > > Initially I wanting to avoid putting policy like this into skiboot but
> > > > since these bugs aren't going away and skiboot barfing is leading to
> > > > lockups and ultimately the host going down something needs to be done.
> > > > 
> > > > I propose that when we detect the problem we fail the mbox call and punt
> > > > the problem back up to Linux. I don't like it but at least it will cause
> > > > errors to cascade and won't bring the host down. I'm not sure how Linux
> > > > is supposed to detect this or what it can even do but this is better
> > > > than a crash.
> > > > 
> > > > Diagnosing a failure to boot if skiboot its self fails to read flash may
> > > > be marginally more difficult with this patch. This is because skiboot
> > > > will now only print one warning about the zero sized window rather than
> > > > continuously spitting it out.
> > > > 
> > > > Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
> > > > Tested-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
> > > > Signed-off-by: Cyril Bur <cyril.bur@au1.ibm.com>
> > > > ---
> > > >  libflash/mbox-flash.c | 11 ++++++++---
> > > >  1 file changed, 8 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/libflash/mbox-flash.c b/libflash/mbox-flash.c
> > > > index 4a3c53f5..70f43f36 100644
> > > > --- a/libflash/mbox-flash.c
> > > > +++ b/libflash/mbox-flash.c
> > > > @@ -334,15 +334,14 @@ static int wait_for_bmc(struct mbox_flash_data *mbox_flash, unsigned int timeout
> > > >  {
> > > >  	unsigned long last = 1, start = tb_to_secs(mftb());
> > > >  	prlog(PR_TRACE, "Waiting for BMC\n");
> > > > -	while (mbox_flash->busy && timeout_sec) {
> > > > +	while (mbox_flash->busy && timeout_sec > last) {
> > > >  		long now = tb_to_secs(mftb());
> > > >  		if (now - start > last) {
> > > > -			timeout_sec--;
> > > > -			last = now - start;
> > > >  			if (last < timeout_sec / 2)
> > > >  				prlog(PR_TRACE, "Been waiting for the BMC for %lu secs\n", last);
> > > >  			else
> > > >  				prlog(PR_ERR, "BMC NOT RESPONDING %lu second wait\n", last);
> > > > +			last++;
> > > >  		}
> > > >  		/*
> > > >  		 * Both functions are important.  
> > > 
> > > This looks like another change got pulled in. Was this intentional?
> > > It doesn't seem like it changes the logic.
> > >   
> > 
> > Yeah I messed with it a bit, thought I'd throw that in here, no change.
> > I was wanting to start some discussion around exactly what you're
> > saying so that's good.
> > 
> > > While I'm looking at this code... wow, 30 seconds is a long time to
> > > wait in firmware with interrupts disabled. 3 seconds is too. Anything
> > > outside crashing, hardware failure handling, and genuine offline
> > > operations (e.g., flashing firmware) and we should be able to count
> > > the milliseconds in firmware.
> > > 
> > > If this is only for writing flash, well then a few seconds seems fine.
> > > What's the minimum reasonable value? Something under 10 seconds would
> > > be good so it doesn't trip the hard lockup watchdog. 3 seconds?
> > >   
> > 
> > Well as discussed before, all bets are off now since it seems the BMC
> > won't tell us of problems in any kind of timely manner. I could move
> > that hunk into a patch
> 
> If it could be split and this one fleshed out, that would be good.
> 
> > that just hardcodes 3 seconds everywhere (i'm
> > glad we put effort into putting it into the mbox protocol so that the
> > BMC wouldn't do it)
> 
> That would be good if we can, or even lower. What's the consequence
> of timing out here in high level terms?  Will it brick your machine,
> or just some tool just error out?
> 
> And what's the response time of the BMC look like in worst conditions?
> Basically we want the point where waiting for longer will be worse
> than failing. 
> 

Well, I went and typed reboot on spoon2-bmc 70 seconds later I was at a
login prompt.

I'll leave that there.

Three seconds is really only going to save us if something loads the
BMC up at exactly the wrong time and it hickups (which does happen...).

I'll pull it out and send that patch because... what other choice do we
have?

> > 
> > > Is there anything more useful you can tell the administrator in the
> > > error message? Should they ensure the BMC is running? Power cycle it?
> > >   
> > 
> > Could add a mention that they need to powercycle their BMC but if we've
> > gone that bad...
> 
> If there is something concise and reasonable we can put in the log
> to help the admin out, that's always good.
> 
> > who knows what state the host is in...
> > 
> > >   
> > > > @@ -709,6 +708,12 @@ static int mbox_window_move(struct mbox_flash_data *mbox_flash,
> > > >  		prlog(PR_ERR, "Move window is indicating size zero!\n");
> > > >  		prlog(PR_ERR, "pos: 0x%" PRIx64 ", len: 0x%" PRIx64 "\n", pos, len);
> > > >  		prlog(PR_ERR, "win pos: 0x%08x win size: 0x%08x\n", win->cur_pos, win->size);
> > > > +		/*
> > > > +		 * In practice skiboot gets stuck and this eventually
> > > > +		 * brings down the host. Just fail pass the error back
> > > > +		 * up and hope someone makes a good decision
> > > > +		 */
> > > > +		return MBOX_R_SYSTEM_ERROR;
> > > >  	}
> > > >  
> > > >  	return rc;  
> > > 
> > > What call chains lead here? Will the callers handle the error properly?
> > > (e.g., will this give something like EIO writing to /dev/nvram?)
> > >   
> > 
> > /dev/nvram is a bit of a special case since skiboot buffers it. A
> > failing attempt to write it back would silently fail (as far as linux
> > is concerned).
> > 
> > /dev/mtd by the powernv_flash driver would give EIO (or some other
> > failure), pflash does get a failed read when trying to open the FFS
> > header from host userspace. As tested by Pridhiviraj.
> 
> If there's a friendlier message you could add here too. I mean if it's
> an expected condition with the firmware stack... Can you say
> 
> "Flash read failed due to BMC mbox daemon indicating 0 size window! Do xxx to fix it."
> 

Where xxx fix your BMC mbox daemon... this really is a "should never
happen case".

The example I use all the time to illustrate this is: 
skiboot tells Linux there is 64mb of flash.
powernv_flash driver creates a mtd device (all good).
Newer ubuntu by default (I wouldn't but they're totally entitled to)
seems to probe that device for what I assume are filesystem special
blocks, sometimes the probe is at a block in the "flash" where the
daemon isn't expecting anyone to read because the flash layout header
indicates there's nothing there.
So it returns a "valid" (when it could have returned an invalid) window
of size zero.

The BMC daemon should just give a window of "empty" flash or, worse
case, say that the window is invalid - both make sense... both work,
both don't result in crashes.


> Thanks,
> Nick
>

Nicholas Piggin March 22, 2018, 6:02 a.m. UTC | #5

On Thu, 22 Mar 2018 16:54:06 +1100
Cyril Bur <cyril.bur@au1.ibm.com> wrote:

> On Thu, 2018-03-22 at 15:15 +1000, Nicholas Piggin wrote:
> > On Thu, 22 Mar 2018 15:31:46 +1100
> > Cyril Bur <cyril.bur@au1.ibm.com> wrote:
> >   
> > > On Thu, 2018-03-22 at 14:01 +1000, Nicholas Piggin wrote:  
> > > > Thanks!
> > > > 
> > > > On Thu, 22 Mar 2018 14:32:35 +1100
> > > > Cyril Bur <cyril.bur@au1.ibm.com> wrote:
> > > >     
> > > > > Bugs present in the BMC daemon mean that skiboot gets presented with
> > > > > mbox windows of size zero. These windows cannot be valid and skiboot
> > > > > already detects these conditions.
> > > > > 
> > > > > Currently skiboot warns quite strongly about the occurrence of these
> > > > > problems. The problem for skiboot is that it doesn't take any action.
> > > > > Initially I wanting to avoid putting policy like this into skiboot but
> > > > > since these bugs aren't going away and skiboot barfing is leading to
> > > > > lockups and ultimately the host going down something needs to be done.
> > > > > 
> > > > > I propose that when we detect the problem we fail the mbox call and punt
> > > > > the problem back up to Linux. I don't like it but at least it will cause
> > > > > errors to cascade and won't bring the host down. I'm not sure how Linux
> > > > > is supposed to detect this or what it can even do but this is better
> > > > > than a crash.
> > > > > 
> > > > > Diagnosing a failure to boot if skiboot its self fails to read flash may
> > > > > be marginally more difficult with this patch. This is because skiboot
> > > > > will now only print one warning about the zero sized window rather than
> > > > > continuously spitting it out.
> > > > > 
> > > > > Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
> > > > > Tested-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
> > > > > Signed-off-by: Cyril Bur <cyril.bur@au1.ibm.com>
> > > > > ---
> > > > >  libflash/mbox-flash.c | 11 ++++++++---
> > > > >  1 file changed, 8 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/libflash/mbox-flash.c b/libflash/mbox-flash.c
> > > > > index 4a3c53f5..70f43f36 100644
> > > > > --- a/libflash/mbox-flash.c
> > > > > +++ b/libflash/mbox-flash.c
> > > > > @@ -334,15 +334,14 @@ static int wait_for_bmc(struct mbox_flash_data *mbox_flash, unsigned int timeout
> > > > >  {
> > > > >  	unsigned long last = 1, start = tb_to_secs(mftb());
> > > > >  	prlog(PR_TRACE, "Waiting for BMC\n");
> > > > > -	while (mbox_flash->busy && timeout_sec) {
> > > > > +	while (mbox_flash->busy && timeout_sec > last) {
> > > > >  		long now = tb_to_secs(mftb());
> > > > >  		if (now - start > last) {
> > > > > -			timeout_sec--;
> > > > > -			last = now - start;
> > > > >  			if (last < timeout_sec / 2)
> > > > >  				prlog(PR_TRACE, "Been waiting for the BMC for %lu secs\n", last);
> > > > >  			else
> > > > >  				prlog(PR_ERR, "BMC NOT RESPONDING %lu second wait\n", last);
> > > > > +			last++;
> > > > >  		}
> > > > >  		/*
> > > > >  		 * Both functions are important.    
> > > > 
> > > > This looks like another change got pulled in. Was this intentional?
> > > > It doesn't seem like it changes the logic.
> > > >     
> > > 
> > > Yeah I messed with it a bit, thought I'd throw that in here, no change.
> > > I was wanting to start some discussion around exactly what you're
> > > saying so that's good.
> > >   
> > > > While I'm looking at this code... wow, 30 seconds is a long time to
> > > > wait in firmware with interrupts disabled. 3 seconds is too. Anything
> > > > outside crashing, hardware failure handling, and genuine offline
> > > > operations (e.g., flashing firmware) and we should be able to count
> > > > the milliseconds in firmware.
> > > > 
> > > > If this is only for writing flash, well then a few seconds seems fine.
> > > > What's the minimum reasonable value? Something under 10 seconds would
> > > > be good so it doesn't trip the hard lockup watchdog. 3 seconds?
> > > >     
> > > 
> > > Well as discussed before, all bets are off now since it seems the BMC
> > > won't tell us of problems in any kind of timely manner. I could move
> > > that hunk into a patch  
> > 
> > If it could be split and this one fleshed out, that would be good.
> >   
> > > that just hardcodes 3 seconds everywhere (i'm
> > > glad we put effort into putting it into the mbox protocol so that the
> > > BMC wouldn't do it)  
> > 
> > That would be good if we can, or even lower. What's the consequence
> > of timing out here in high level terms?  Will it brick your machine,
> > or just some tool just error out?
> > 
> > And what's the response time of the BMC look like in worst conditions?
> > Basically we want the point where waiting for longer will be worse
> > than failing. 
> >   
> 
> Well, I went and typed reboot on spoon2-bmc 70 seconds later I was at a
> login prompt.
> 
> I'll leave that there.
> 
> Three seconds is really only going to save us if something loads the
> BMC up at exactly the wrong time and it hickups (which does happen...).

Right. We can never cope with BMC reboot unless the OPAL API is
completely async, and the mbox protocol is sufficiently stateless
or recoverable so it will do something useful across reboots.

Which we should aim for *eventually*. I'll take 3 seconds for now
so the watchdog won't kill the host :) 

> 
> I'll pull it out and send that patch because... what other choice do we
> have?
> 
> > >   
> > > > Is there anything more useful you can tell the administrator in the
> > > > error message? Should they ensure the BMC is running? Power cycle it?
> > > >     
> > > 
> > > Could add a mention that they need to powercycle their BMC but if we've
> > > gone that bad...  
> > 
> > If there is something concise and reasonable we can put in the log
> > to help the admin out, that's always good.
> >   
> > > who knows what state the host is in...
> > >   
> > > >     
> > > > > @@ -709,6 +708,12 @@ static int mbox_window_move(struct mbox_flash_data *mbox_flash,
> > > > >  		prlog(PR_ERR, "Move window is indicating size zero!\n");
> > > > >  		prlog(PR_ERR, "pos: 0x%" PRIx64 ", len: 0x%" PRIx64 "\n", pos, len);
> > > > >  		prlog(PR_ERR, "win pos: 0x%08x win size: 0x%08x\n", win->cur_pos, win->size);
> > > > > +		/*
> > > > > +		 * In practice skiboot gets stuck and this eventually
> > > > > +		 * brings down the host. Just fail pass the error back
> > > > > +		 * up and hope someone makes a good decision
> > > > > +		 */
> > > > > +		return MBOX_R_SYSTEM_ERROR;
> > > > >  	}
> > > > >  
> > > > >  	return rc;    
> > > > 
> > > > What call chains lead here? Will the callers handle the error properly?
> > > > (e.g., will this give something like EIO writing to /dev/nvram?)
> > > >     
> > > 
> > > /dev/nvram is a bit of a special case since skiboot buffers it. A
> > > failing attempt to write it back would silently fail (as far as linux
> > > is concerned).
> > > 
> > > /dev/mtd by the powernv_flash driver would give EIO (or some other
> > > failure), pflash does get a failed read when trying to open the FFS
> > > header from host userspace. As tested by Pridhiviraj.  
> > 
> > If there's a friendlier message you could add here too. I mean if it's
> > an expected condition with the firmware stack... Can you say
> > 
> > "Flash read failed due to BMC mbox daemon indicating 0 size window! Do xxx to fix it."
> >   
> 
> Where xxx fix your BMC mbox daemon... this really is a "should never
> happen case".
> 
> The example I use all the time to illustrate this is: 
> skiboot tells Linux there is 64mb of flash.
> powernv_flash driver creates a mtd device (all good).
> Newer ubuntu by default (I wouldn't but they're totally entitled to)
> seems to probe that device for what I assume are filesystem special
> blocks, sometimes the probe is at a block in the "flash" where the
> daemon isn't expecting anyone to read because the flash layout header
> indicates there's nothing there.
> So it returns a "valid" (when it could have returned an invalid) window
> of size zero.
> 
> The BMC daemon should just give a window of "empty" flash or, worse
> case, say that the window is invalid - both make sense... both work,
> both don't result in crashes.

Okay, I think EIO from the /dev/ IO is pretty appropriate in that case.
Does there even need to even be a user visible message?

Thanks,
Nick

Cyril Bur March 22, 2018, 6:07 a.m. UTC | #6

On Thu, 2018-03-22 at 16:02 +1000, Nicholas Piggin wrote:
> On Thu, 22 Mar 2018 16:54:06 +1100
> Cyril Bur <cyril.bur@au1.ibm.com> wrote:
> 
> > On Thu, 2018-03-22 at 15:15 +1000, Nicholas Piggin wrote:
> > > On Thu, 22 Mar 2018 15:31:46 +1100
> > > Cyril Bur <cyril.bur@au1.ibm.com> wrote:
> > >   
> > > > On Thu, 2018-03-22 at 14:01 +1000, Nicholas Piggin wrote:  
> > > > > Thanks!
> > > > > 
> > > > > On Thu, 22 Mar 2018 14:32:35 +1100
> > > > > Cyril Bur <cyril.bur@au1.ibm.com> wrote:
> > > > >     
> > > > > > Bugs present in the BMC daemon mean that skiboot gets presented with
> > > > > > mbox windows of size zero. These windows cannot be valid and skiboot
> > > > > > already detects these conditions.
> > > > > > 
> > > > > > Currently skiboot warns quite strongly about the occurrence of these
> > > > > > problems. The problem for skiboot is that it doesn't take any action.
> > > > > > Initially I wanting to avoid putting policy like this into skiboot but
> > > > > > since these bugs aren't going away and skiboot barfing is leading to
> > > > > > lockups and ultimately the host going down something needs to be done.
> > > > > > 
> > > > > > I propose that when we detect the problem we fail the mbox call and punt
> > > > > > the problem back up to Linux. I don't like it but at least it will cause
> > > > > > errors to cascade and won't bring the host down. I'm not sure how Linux
> > > > > > is supposed to detect this or what it can even do but this is better
> > > > > > than a crash.
> > > > > > 
> > > > > > Diagnosing a failure to boot if skiboot its self fails to read flash may
> > > > > > be marginally more difficult with this patch. This is because skiboot
> > > > > > will now only print one warning about the zero sized window rather than
> > > > > > continuously spitting it out.
> > > > > > 
> > > > > > Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
> > > > > > Tested-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
> > > > > > Signed-off-by: Cyril Bur <cyril.bur@au1.ibm.com>
> > > > > > ---
> > > > > >  libflash/mbox-flash.c | 11 ++++++++---
> > > > > >  1 file changed, 8 insertions(+), 3 deletions(-)
> > > > > > 
> > > > > > diff --git a/libflash/mbox-flash.c b/libflash/mbox-flash.c
> > > > > > index 4a3c53f5..70f43f36 100644
> > > > > > --- a/libflash/mbox-flash.c
> > > > > > +++ b/libflash/mbox-flash.c
> > > > > > @@ -334,15 +334,14 @@ static int wait_for_bmc(struct mbox_flash_data *mbox_flash, unsigned int timeout
> > > > > >  {
> > > > > >  	unsigned long last = 1, start = tb_to_secs(mftb());
> > > > > >  	prlog(PR_TRACE, "Waiting for BMC\n");
> > > > > > -	while (mbox_flash->busy && timeout_sec) {
> > > > > > +	while (mbox_flash->busy && timeout_sec > last) {
> > > > > >  		long now = tb_to_secs(mftb());
> > > > > >  		if (now - start > last) {
> > > > > > -			timeout_sec--;
> > > > > > -			last = now - start;
> > > > > >  			if (last < timeout_sec / 2)
> > > > > >  				prlog(PR_TRACE, "Been waiting for the BMC for %lu secs\n", last);
> > > > > >  			else
> > > > > >  				prlog(PR_ERR, "BMC NOT RESPONDING %lu second wait\n", last);
> > > > > > +			last++;
> > > > > >  		}
> > > > > >  		/*
> > > > > >  		 * Both functions are important.    
> > > > > 
> > > > > This looks like another change got pulled in. Was this intentional?
> > > > > It doesn't seem like it changes the logic.
> > > > >     
> > > > 
> > > > Yeah I messed with it a bit, thought I'd throw that in here, no change.
> > > > I was wanting to start some discussion around exactly what you're
> > > > saying so that's good.
> > > >   
> > > > > While I'm looking at this code... wow, 30 seconds is a long time to
> > > > > wait in firmware with interrupts disabled. 3 seconds is too. Anything
> > > > > outside crashing, hardware failure handling, and genuine offline
> > > > > operations (e.g., flashing firmware) and we should be able to count
> > > > > the milliseconds in firmware.
> > > > > 
> > > > > If this is only for writing flash, well then a few seconds seems fine.
> > > > > What's the minimum reasonable value? Something under 10 seconds would
> > > > > be good so it doesn't trip the hard lockup watchdog. 3 seconds?
> > > > >     
> > > > 
> > > > Well as discussed before, all bets are off now since it seems the BMC
> > > > won't tell us of problems in any kind of timely manner. I could move
> > > > that hunk into a patch  
> > > 
> > > If it could be split and this one fleshed out, that would be good.
> > >   
> > > > that just hardcodes 3 seconds everywhere (i'm
> > > > glad we put effort into putting it into the mbox protocol so that the
> > > > BMC wouldn't do it)  
> > > 
> > > That would be good if we can, or even lower. What's the consequence
> > > of timing out here in high level terms?  Will it brick your machine,
> > > or just some tool just error out?
> > > 
> > > And what's the response time of the BMC look like in worst conditions?
> > > Basically we want the point where waiting for longer will be worse
> > > than failing. 
> > >   
> > 
> > Well, I went and typed reboot on spoon2-bmc 70 seconds later I was at a
> > login prompt.
> > 
> > I'll leave that there.
> > 
> > Three seconds is really only going to save us if something loads the
> > BMC up at exactly the wrong time and it hickups (which does happen...).
> 
> Right. We can never cope with BMC reboot unless the OPAL API is
> completely async, and the mbox protocol is sufficiently stateless
> or recoverable so it will do something useful across reboots.
> 
> Which we should aim for *eventually*. I'll take 3 seconds for now
> so the watchdog won't kill the host :) 
> 

Ack

> > 
> > I'll pull it out and send that patch because... what other choice do we
> > have?
> > 
> > > >   
> > > > > Is there anything more useful you can tell the administrator in the
> > > > > error message? Should they ensure the BMC is running? Power cycle it?
> > > > >     
> > > > 
> > > > Could add a mention that they need to powercycle their BMC but if we've
> > > > gone that bad...  
> > > 
> > > If there is something concise and reasonable we can put in the log
> > > to help the admin out, that's always good.
> > >   
> > > > who knows what state the host is in...
> > > >   
> > > > >     
> > > > > > @@ -709,6 +708,12 @@ static int mbox_window_move(struct mbox_flash_data *mbox_flash,
> > > > > >  		prlog(PR_ERR, "Move window is indicating size zero!\n");
> > > > > >  		prlog(PR_ERR, "pos: 0x%" PRIx64 ", len: 0x%" PRIx64 "\n", pos, len);
> > > > > >  		prlog(PR_ERR, "win pos: 0x%08x win size: 0x%08x\n", win->cur_pos, win->size);
> > > > > > +		/*
> > > > > > +		 * In practice skiboot gets stuck and this eventually
> > > > > > +		 * brings down the host. Just fail pass the error back
> > > > > > +		 * up and hope someone makes a good decision
> > > > > > +		 */
> > > > > > +		return MBOX_R_SYSTEM_ERROR;
> > > > > >  	}
> > > > > >  
> > > > > >  	return rc;    
> > > > > 
> > > > > What call chains lead here? Will the callers handle the error properly?
> > > > > (e.g., will this give something like EIO writing to /dev/nvram?)
> > > > >     
> > > > 
> > > > /dev/nvram is a bit of a special case since skiboot buffers it. A
> > > > failing attempt to write it back would silently fail (as far as linux
> > > > is concerned).
> > > > 
> > > > /dev/mtd by the powernv_flash driver would give EIO (or some other
> > > > failure), pflash does get a failed read when trying to open the FFS
> > > > header from host userspace. As tested by Pridhiviraj.  
> > > 
> > > If there's a friendlier message you could add here too. I mean if it's
> > > an expected condition with the firmware stack... Can you say
> > > 
> > > "Flash read failed due to BMC mbox daemon indicating 0 size window! Do xxx to fix it."
> > >   
> > 
> > Where xxx fix your BMC mbox daemon... this really is a "should never
> > happen case".
> > 
> > The example I use all the time to illustrate this is: 
> > skiboot tells Linux there is 64mb of flash.
> > powernv_flash driver creates a mtd device (all good).
> > Newer ubuntu by default (I wouldn't but they're totally entitled to)
> > seems to probe that device for what I assume are filesystem special
> > blocks, sometimes the probe is at a block in the "flash" where the
> > daemon isn't expecting anyone to read because the flash layout header
> > indicates there's nothing there.
> > So it returns a "valid" (when it could have returned an invalid) window
> > of size zero.
> > 
> > The BMC daemon should just give a window of "empty" flash or, worse
> > case, say that the window is invalid - both make sense... both work,
> > both don't result in crashes.
> 
> Okay, I think EIO from the /dev/ IO is pretty appropriate in that case.
> Does there even need to even be a user visible message?

The only reason there is a big barf about zero sized windows is that it
is hard to debug then skiboot either locks up (as it does today) or as
will happen now, read/writes start failing. I see your point and I
don't disagree that perhaps those could be debug level prints with a
single line "Please update your BMC!" (because I'm assuming these bugs
will get fixed) at error level or something.

> 
> Thanks,
> Nick
>

Nicholas Piggin March 22, 2018, 6:26 a.m. UTC | #7

On Thu, 22 Mar 2018 17:07:45 +1100
Cyril Bur <cyril.bur@au1.ibm.com> wrote:

> On Thu, 2018-03-22 at 16:02 +1000, Nicholas Piggin wrote:
> > On Thu, 22 Mar 2018 16:54:06 +1100
> > Cyril Bur <cyril.bur@au1.ibm.com> wrote:
> >   
> > > On Thu, 2018-03-22 at 15:15 +1000, Nicholas Piggin wrote:  
> > > > On Thu, 22 Mar 2018 15:31:46 +1100
> > > > Cyril Bur <cyril.bur@au1.ibm.com> wrote:
> > > >     
> > > > > On Thu, 2018-03-22 at 14:01 +1000, Nicholas Piggin wrote:    
> > > > > > Thanks!
> > > > > > 
> > > > > > On Thu, 22 Mar 2018 14:32:35 +1100
> > > > > > Cyril Bur <cyril.bur@au1.ibm.com> wrote:
> > > > > >       
> > > > > > > Bugs present in the BMC daemon mean that skiboot gets presented with
> > > > > > > mbox windows of size zero. These windows cannot be valid and skiboot
> > > > > > > already detects these conditions.
> > > > > > > 
> > > > > > > Currently skiboot warns quite strongly about the occurrence of these
> > > > > > > problems. The problem for skiboot is that it doesn't take any action.
> > > > > > > Initially I wanting to avoid putting policy like this into skiboot but
> > > > > > > since these bugs aren't going away and skiboot barfing is leading to
> > > > > > > lockups and ultimately the host going down something needs to be done.
> > > > > > > 
> > > > > > > I propose that when we detect the problem we fail the mbox call and punt
> > > > > > > the problem back up to Linux. I don't like it but at least it will cause
> > > > > > > errors to cascade and won't bring the host down. I'm not sure how Linux
> > > > > > > is supposed to detect this or what it can even do but this is better
> > > > > > > than a crash.
> > > > > > > 
> > > > > > > Diagnosing a failure to boot if skiboot its self fails to read flash may
> > > > > > > be marginally more difficult with this patch. This is because skiboot
> > > > > > > will now only print one warning about the zero sized window rather than
> > > > > > > continuously spitting it out.
> > > > > > > 
> > > > > > > Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
> > > > > > > Tested-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
> > > > > > > Signed-off-by: Cyril Bur <cyril.bur@au1.ibm.com>
> > > > > > > ---
> > > > > > >  libflash/mbox-flash.c | 11 ++++++++---
> > > > > > >  1 file changed, 8 insertions(+), 3 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/libflash/mbox-flash.c b/libflash/mbox-flash.c
> > > > > > > index 4a3c53f5..70f43f36 100644
> > > > > > > --- a/libflash/mbox-flash.c
> > > > > > > +++ b/libflash/mbox-flash.c
> > > > > > > @@ -334,15 +334,14 @@ static int wait_for_bmc(struct mbox_flash_data *mbox_flash, unsigned int timeout
> > > > > > >  {
> > > > > > >  	unsigned long last = 1, start = tb_to_secs(mftb());
> > > > > > >  	prlog(PR_TRACE, "Waiting for BMC\n");
> > > > > > > -	while (mbox_flash->busy && timeout_sec) {
> > > > > > > +	while (mbox_flash->busy && timeout_sec > last) {
> > > > > > >  		long now = tb_to_secs(mftb());
> > > > > > >  		if (now - start > last) {
> > > > > > > -			timeout_sec--;
> > > > > > > -			last = now - start;
> > > > > > >  			if (last < timeout_sec / 2)
> > > > > > >  				prlog(PR_TRACE, "Been waiting for the BMC for %lu secs\n", last);
> > > > > > >  			else
> > > > > > >  				prlog(PR_ERR, "BMC NOT RESPONDING %lu second wait\n", last);
> > > > > > > +			last++;
> > > > > > >  		}
> > > > > > >  		/*
> > > > > > >  		 * Both functions are important.      
> > > > > > 
> > > > > > This looks like another change got pulled in. Was this intentional?
> > > > > > It doesn't seem like it changes the logic.
> > > > > >       
> > > > > 
> > > > > Yeah I messed with it a bit, thought I'd throw that in here, no change.
> > > > > I was wanting to start some discussion around exactly what you're
> > > > > saying so that's good.
> > > > >     
> > > > > > While I'm looking at this code... wow, 30 seconds is a long time to
> > > > > > wait in firmware with interrupts disabled. 3 seconds is too. Anything
> > > > > > outside crashing, hardware failure handling, and genuine offline
> > > > > > operations (e.g., flashing firmware) and we should be able to count
> > > > > > the milliseconds in firmware.
> > > > > > 
> > > > > > If this is only for writing flash, well then a few seconds seems fine.
> > > > > > What's the minimum reasonable value? Something under 10 seconds would
> > > > > > be good so it doesn't trip the hard lockup watchdog. 3 seconds?
> > > > > >       
> > > > > 
> > > > > Well as discussed before, all bets are off now since it seems the BMC
> > > > > won't tell us of problems in any kind of timely manner. I could move
> > > > > that hunk into a patch    
> > > > 
> > > > If it could be split and this one fleshed out, that would be good.
> > > >     
> > > > > that just hardcodes 3 seconds everywhere (i'm
> > > > > glad we put effort into putting it into the mbox protocol so that the
> > > > > BMC wouldn't do it)    
> > > > 
> > > > That would be good if we can, or even lower. What's the consequence
> > > > of timing out here in high level terms?  Will it brick your machine,
> > > > or just some tool just error out?
> > > > 
> > > > And what's the response time of the BMC look like in worst conditions?
> > > > Basically we want the point where waiting for longer will be worse
> > > > than failing. 
> > > >     
> > > 
> > > Well, I went and typed reboot on spoon2-bmc 70 seconds later I was at a
> > > login prompt.
> > > 
> > > I'll leave that there.
> > > 
> > > Three seconds is really only going to save us if something loads the
> > > BMC up at exactly the wrong time and it hickups (which does happen...).  
> > 
> > Right. We can never cope with BMC reboot unless the OPAL API is
> > completely async, and the mbox protocol is sufficiently stateless
> > or recoverable so it will do something useful across reboots.
> > 
> > Which we should aim for *eventually*. I'll take 3 seconds for now
> > so the watchdog won't kill the host :) 
> >   
> 
> Ack
> 
> > > 
> > > I'll pull it out and send that patch because... what other choice do we
> > > have?
> > >   
> > > > >     
> > > > > > Is there anything more useful you can tell the administrator in the
> > > > > > error message? Should they ensure the BMC is running? Power cycle it?
> > > > > >       
> > > > > 
> > > > > Could add a mention that they need to powercycle their BMC but if we've
> > > > > gone that bad...    
> > > > 
> > > > If there is something concise and reasonable we can put in the log
> > > > to help the admin out, that's always good.
> > > >     
> > > > > who knows what state the host is in...
> > > > >     
> > > > > >       
> > > > > > > @@ -709,6 +708,12 @@ static int mbox_window_move(struct mbox_flash_data *mbox_flash,
> > > > > > >  		prlog(PR_ERR, "Move window is indicating size zero!\n");
> > > > > > >  		prlog(PR_ERR, "pos: 0x%" PRIx64 ", len: 0x%" PRIx64 "\n", pos, len);
> > > > > > >  		prlog(PR_ERR, "win pos: 0x%08x win size: 0x%08x\n", win->cur_pos, win->size);
> > > > > > > +		/*
> > > > > > > +		 * In practice skiboot gets stuck and this eventually
> > > > > > > +		 * brings down the host. Just fail pass the error back
> > > > > > > +		 * up and hope someone makes a good decision
> > > > > > > +		 */
> > > > > > > +		return MBOX_R_SYSTEM_ERROR;
> > > > > > >  	}
> > > > > > >  
> > > > > > >  	return rc;      
> > > > > > 
> > > > > > What call chains lead here? Will the callers handle the error properly?
> > > > > > (e.g., will this give something like EIO writing to /dev/nvram?)
> > > > > >       
> > > > > 
> > > > > /dev/nvram is a bit of a special case since skiboot buffers it. A
> > > > > failing attempt to write it back would silently fail (as far as linux
> > > > > is concerned).
> > > > > 
> > > > > /dev/mtd by the powernv_flash driver would give EIO (or some other
> > > > > failure), pflash does get a failed read when trying to open the FFS
> > > > > header from host userspace. As tested by Pridhiviraj.    
> > > > 
> > > > If there's a friendlier message you could add here too. I mean if it's
> > > > an expected condition with the firmware stack... Can you say
> > > > 
> > > > "Flash read failed due to BMC mbox daemon indicating 0 size window! Do xxx to fix it."
> > > >     
> > > 
> > > Where xxx fix your BMC mbox daemon... this really is a "should never
> > > happen case".
> > > 
> > > The example I use all the time to illustrate this is: 
> > > skiboot tells Linux there is 64mb of flash.
> > > powernv_flash driver creates a mtd device (all good).
> > > Newer ubuntu by default (I wouldn't but they're totally entitled to)
> > > seems to probe that device for what I assume are filesystem special
> > > blocks, sometimes the probe is at a block in the "flash" where the
> > > daemon isn't expecting anyone to read because the flash layout header
> > > indicates there's nothing there.
> > > So it returns a "valid" (when it could have returned an invalid) window
> > > of size zero.
> > > 
> > > The BMC daemon should just give a window of "empty" flash or, worse
> > > case, say that the window is invalid - both make sense... both work,
> > > both don't result in crashes.  
> > 
> > Okay, I think EIO from the /dev/ IO is pretty appropriate in that case.
> > Does there even need to even be a user visible message?  
> 
> The only reason there is a big barf about zero sized windows is that it
> is hard to debug then skiboot either locks up (as it does today) or as
> will happen now, read/writes start failing.

Oh they start permanently failing? Okay I thought it was just a once
off for the undefined regions. Yes then something should be printed.

> I see your point and I
> don't disagree that perhaps those could be debug level prints with a
> single line "Please update your BMC!" (because I'm assuming these bugs
> will get fixed) at error level or something.

Ah, I misunderstood then. I thought the current BMCs still had this
behaviour. Yes in this case, something like "BMC mbox flash writes
failing due to BMC returned zero sized window.\nPlease update your
BMC firmware!\n"

Thanks,
Nick

Cyril Bur March 22, 2018, 11:49 p.m. UTC | #8

On Thu, 2018-03-22 at 16:26 +1000, Nicholas Piggin wrote:
> On Thu, 22 Mar 2018 17:07:45 +1100
> Cyril Bur <cyril.bur@au1.ibm.com> wrote:
> 
> > On Thu, 2018-03-22 at 16:02 +1000, Nicholas Piggin wrote:
> > > On Thu, 22 Mar 2018 16:54:06 +1100
> > > Cyril Bur <cyril.bur@au1.ibm.com> wrote:
> > >   
> > > > On Thu, 2018-03-22 at 15:15 +1000, Nicholas Piggin wrote:  
> > > > > On Thu, 22 Mar 2018 15:31:46 +1100
> > > > > Cyril Bur <cyril.bur@au1.ibm.com> wrote:
> > > > >     
> > > > > > On Thu, 2018-03-22 at 14:01 +1000, Nicholas Piggin wrote:    
> > > > > > > Thanks!
> > > > > > > 
> > > > > > > On Thu, 22 Mar 2018 14:32:35 +1100
> > > > > > > Cyril Bur <cyril.bur@au1.ibm.com> wrote:
> > > > > > >       
> > > > > > > > Bugs present in the BMC daemon mean that skiboot gets presented with
> > > > > > > > mbox windows of size zero. These windows cannot be valid and skiboot
> > > > > > > > already detects these conditions.
> > > > > > > > 
> > > > > > > > Currently skiboot warns quite strongly about the occurrence of these
> > > > > > > > problems. The problem for skiboot is that it doesn't take any action.
> > > > > > > > Initially I wanting to avoid putting policy like this into skiboot but
> > > > > > > > since these bugs aren't going away and skiboot barfing is leading to
> > > > > > > > lockups and ultimately the host going down something needs to be done.
> > > > > > > > 
> > > > > > > > I propose that when we detect the problem we fail the mbox call and punt
> > > > > > > > the problem back up to Linux. I don't like it but at least it will cause
> > > > > > > > errors to cascade and won't bring the host down. I'm not sure how Linux
> > > > > > > > is supposed to detect this or what it can even do but this is better
> > > > > > > > than a crash.
> > > > > > > > 
> > > > > > > > Diagnosing a failure to boot if skiboot its self fails to read flash may
> > > > > > > > be marginally more difficult with this patch. This is because skiboot
> > > > > > > > will now only print one warning about the zero sized window rather than
> > > > > > > > continuously spitting it out.
> > > > > > > > 
> > > > > > > > Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
> > > > > > > > Tested-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
> > > > > > > > Signed-off-by: Cyril Bur <cyril.bur@au1.ibm.com>
> > > > > > > > ---
> > > > > > > >  libflash/mbox-flash.c | 11 ++++++++---
> > > > > > > >  1 file changed, 8 insertions(+), 3 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/libflash/mbox-flash.c b/libflash/mbox-flash.c
> > > > > > > > index 4a3c53f5..70f43f36 100644
> > > > > > > > --- a/libflash/mbox-flash.c
> > > > > > > > +++ b/libflash/mbox-flash.c
> > > > > > > > @@ -334,15 +334,14 @@ static int wait_for_bmc(struct mbox_flash_data *mbox_flash, unsigned int timeout
> > > > > > > >  {
> > > > > > > >  	unsigned long last = 1, start = tb_to_secs(mftb());
> > > > > > > >  	prlog(PR_TRACE, "Waiting for BMC\n");
> > > > > > > > -	while (mbox_flash->busy && timeout_sec) {
> > > > > > > > +	while (mbox_flash->busy && timeout_sec > last) {
> > > > > > > >  		long now = tb_to_secs(mftb());
> > > > > > > >  		if (now - start > last) {
> > > > > > > > -			timeout_sec--;
> > > > > > > > -			last = now - start;
> > > > > > > >  			if (last < timeout_sec / 2)
> > > > > > > >  				prlog(PR_TRACE, "Been waiting for the BMC for %lu secs\n", last);
> > > > > > > >  			else
> > > > > > > >  				prlog(PR_ERR, "BMC NOT RESPONDING %lu second wait\n", last);
> > > > > > > > +			last++;
> > > > > > > >  		}
> > > > > > > >  		/*
> > > > > > > >  		 * Both functions are important.      
> > > > > > > 
> > > > > > > This looks like another change got pulled in. Was this intentional?
> > > > > > > It doesn't seem like it changes the logic.
> > > > > > >       
> > > > > > 
> > > > > > Yeah I messed with it a bit, thought I'd throw that in here, no change.
> > > > > > I was wanting to start some discussion around exactly what you're
> > > > > > saying so that's good.
> > > > > >     
> > > > > > > While I'm looking at this code... wow, 30 seconds is a long time to
> > > > > > > wait in firmware with interrupts disabled. 3 seconds is too. Anything
> > > > > > > outside crashing, hardware failure handling, and genuine offline
> > > > > > > operations (e.g., flashing firmware) and we should be able to count
> > > > > > > the milliseconds in firmware.
> > > > > > > 
> > > > > > > If this is only for writing flash, well then a few seconds seems fine.
> > > > > > > What's the minimum reasonable value? Something under 10 seconds would
> > > > > > > be good so it doesn't trip the hard lockup watchdog. 3 seconds?
> > > > > > >       
> > > > > > 
> > > > > > Well as discussed before, all bets are off now since it seems the BMC
> > > > > > won't tell us of problems in any kind of timely manner. I could move
> > > > > > that hunk into a patch    
> > > > > 
> > > > > If it could be split and this one fleshed out, that would be good.
> > > > >     
> > > > > > that just hardcodes 3 seconds everywhere (i'm
> > > > > > glad we put effort into putting it into the mbox protocol so that the
> > > > > > BMC wouldn't do it)    
> > > > > 
> > > > > That would be good if we can, or even lower. What's the consequence
> > > > > of timing out here in high level terms?  Will it brick your machine,
> > > > > or just some tool just error out?
> > > > > 
> > > > > And what's the response time of the BMC look like in worst conditions?
> > > > > Basically we want the point where waiting for longer will be worse
> > > > > than failing. 
> > > > >     
> > > > 
> > > > Well, I went and typed reboot on spoon2-bmc 70 seconds later I was at a
> > > > login prompt.
> > > > 
> > > > I'll leave that there.
> > > > 
> > > > Three seconds is really only going to save us if something loads the
> > > > BMC up at exactly the wrong time and it hickups (which does happen...).  
> > > 
> > > Right. We can never cope with BMC reboot unless the OPAL API is
> > > completely async, and the mbox protocol is sufficiently stateless
> > > or recoverable so it will do something useful across reboots.
> > > 
> > > Which we should aim for *eventually*. I'll take 3 seconds for now
> > > so the watchdog won't kill the host :) 
> > >   
> > 
> > Ack
> > 
> > > > 
> > > > I'll pull it out and send that patch because... what other choice do we
> > > > have?
> > > >   
> > > > > >     
> > > > > > > Is there anything more useful you can tell the administrator in the
> > > > > > > error message? Should they ensure the BMC is running? Power cycle it?
> > > > > > >       
> > > > > > 
> > > > > > Could add a mention that they need to powercycle their BMC but if we've
> > > > > > gone that bad...    
> > > > > 
> > > > > If there is something concise and reasonable we can put in the log
> > > > > to help the admin out, that's always good.
> > > > >     
> > > > > > who knows what state the host is in...
> > > > > >     
> > > > > > >       
> > > > > > > > @@ -709,6 +708,12 @@ static int mbox_window_move(struct mbox_flash_data *mbox_flash,
> > > > > > > >  		prlog(PR_ERR, "Move window is indicating size zero!\n");
> > > > > > > >  		prlog(PR_ERR, "pos: 0x%" PRIx64 ", len: 0x%" PRIx64 "\n", pos, len);
> > > > > > > >  		prlog(PR_ERR, "win pos: 0x%08x win size: 0x%08x\n", win->cur_pos, win->size);
> > > > > > > > +		/*
> > > > > > > > +		 * In practice skiboot gets stuck and this eventually
> > > > > > > > +		 * brings down the host. Just fail pass the error back
> > > > > > > > +		 * up and hope someone makes a good decision
> > > > > > > > +		 */
> > > > > > > > +		return MBOX_R_SYSTEM_ERROR;
> > > > > > > >  	}
> > > > > > > >  
> > > > > > > >  	return rc;      
> > > > > > > 
> > > > > > > What call chains lead here? Will the callers handle the error properly?
> > > > > > > (e.g., will this give something like EIO writing to /dev/nvram?)
> > > > > > >       
> > > > > > 
> > > > > > /dev/nvram is a bit of a special case since skiboot buffers it. A
> > > > > > failing attempt to write it back would silently fail (as far as linux
> > > > > > is concerned).
> > > > > > 
> > > > > > /dev/mtd by the powernv_flash driver would give EIO (or some other
> > > > > > failure), pflash does get a failed read when trying to open the FFS
> > > > > > header from host userspace. As tested by Pridhiviraj.    
> > > > > 
> > > > > If there's a friendlier message you could add here too. I mean if it's
> > > > > an expected condition with the firmware stack... Can you say
> > > > > 
> > > > > "Flash read failed due to BMC mbox daemon indicating 0 size window! Do xxx to fix it."
> > > > >     
> > > > 
> > > > Where xxx fix your BMC mbox daemon... this really is a "should never
> > > > happen case".
> > > > 
> > > > The example I use all the time to illustrate this is: 
> > > > skiboot tells Linux there is 64mb of flash.
> > > > powernv_flash driver creates a mtd device (all good).
> > > > Newer ubuntu by default (I wouldn't but they're totally entitled to)
> > > > seems to probe that device for what I assume are filesystem special
> > > > blocks, sometimes the probe is at a block in the "flash" where the
> > > > daemon isn't expecting anyone to read because the flash layout header
> > > > indicates there's nothing there.
> > > > So it returns a "valid" (when it could have returned an invalid) window
> > > > of size zero.
> > > > 
> > > > The BMC daemon should just give a window of "empty" flash or, worse
> > > > case, say that the window is invalid - both make sense... both work,
> > > > both don't result in crashes.  
> > > 
> > > Okay, I think EIO from the /dev/ IO is pretty appropriate in that case.
> > > Does there even need to even be a user visible message?  
> > 
> > The only reason there is a big barf about zero sized windows is that it
> > is hard to debug then skiboot either locks up (as it does today) or as
> > will happen now, read/writes start failing.
> 
> Oh they start permanently failing? Okay I thought it was just a once
> off for the undefined regions. Yes then something should be printed.
> 

Yeah, I guess part of the reason I did it the way I initially did it is
that there was a vague hope that this kind of error would be transient
and therefore a retry is exactly what skiboot should do.

In practice, nope - we can pretty much assume that if we see zero sized
windows we're screwed.

> > I see your point and I
> > don't disagree that perhaps those could be debug level prints with a
> > single line "Please update your BMC!" (because I'm assuming these bugs
> > will get fixed) at error level or something.
> 
> Ah, I misunderstood then. I thought the current BMCs still had this
> behaviour. Yes in this case, something like "BMC mbox flash writes
> failing due to BMC returned zero sized window.\nPlease update your
> BMC firmware!\n"
> 
> Thanks,
> Nick
>

mbox: Harden against BMC daemon errors

Commit Message

Comments

Patch