diff mbox series

[v2,01/19] powerpc/rtas: handle extended delays safely in early boot

Message ID 20230125-b4-powerpc-rtas-queue-v2-1-9aa6bd058063@linux.ibm.com (mailing list archive)
State Superseded
Headers show
Series RTAS maintenance | expand

Commit Message

Nathan Lynch via B4 Relay Feb. 6, 2023, 6:54 p.m. UTC
From: Nathan Lynch <nathanl@linux.ibm.com>

Some code that runs early in boot calls RTAS functions that can return
-2 or 990x statuses, which mean the caller should retry. An example is
pSeries_cmo_feature_init(), which invokes ibm,get-system-parameter but
treats these benign statuses as errors instead of retrying.

pSeries_cmo_feature_init() and similar code should be made to retry
until they succeed or receive a real error, using the usual pattern:

	do {
		rc = rtas_call(token, etc...);
	} while (rtas_busy_delay(rc));

But rtas_busy_delay() will perform a timed sleep on any 990x
status. This isn't safe so early in boot, before the CPU scheduler and
timer subsystem have initialized.

The -2 RTAS status is much more likely to occur during single-threaded
boot than 990x in practice, at least on PowerVM. This is because -2
usually means that RTAS made progress but exhausted its self-imposed
timeslice, while 990x is associated with concurrent requests from the
OS causing internal contention. Regardless, according to the language
in PAPR, the OS should be prepared to handle either type of status at
any time.

Add a fallback path to rtas_busy_delay() to handle this as safely as
possible, performing a small delay on 990x. Include a counter to
detect retry loops that aren't making progress and bail out.

This was found by inspection and I'm not aware of any real
failures. However, the implementation of rtas_busy_delay() before
commit 38f7b7067dae ("powerpc/rtas: rtas_busy_delay() improvements")
was not susceptible to this problem, so let's treat this as a
regression.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Fixes: 38f7b7067dae ("powerpc/rtas: rtas_busy_delay() improvements")
---
 arch/powerpc/kernel/rtas.c | 48 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 47 insertions(+), 1 deletion(-)

Comments

Michael Ellerman Feb. 8, 2023, 11:20 a.m. UTC | #1
Nathan Lynch via B4 Submission Endpoint <devnull+nathanl.linux.ibm.com@kernel.org> writes:
> From: Nathan Lynch <nathanl@linux.ibm.com>
>
> Some code that runs early in boot calls RTAS functions that can return
> -2 or 990x statuses, which mean the caller should retry. An example is
> pSeries_cmo_feature_init(), which invokes ibm,get-system-parameter but
> treats these benign statuses as errors instead of retrying.
>
> pSeries_cmo_feature_init() and similar code should be made to retry
> until they succeed or receive a real error, using the usual pattern:
>
> 	do {
> 		rc = rtas_call(token, etc...);
> 	} while (rtas_busy_delay(rc));
>
> But rtas_busy_delay() will perform a timed sleep on any 990x
> status. This isn't safe so early in boot, before the CPU scheduler and
> timer subsystem have initialized.
>
> The -2 RTAS status is much more likely to occur during single-threaded
> boot than 990x in practice, at least on PowerVM. This is because -2
> usually means that RTAS made progress but exhausted its self-imposed
> timeslice, while 990x is associated with concurrent requests from the
> OS causing internal contention. Regardless, according to the language
> in PAPR, the OS should be prepared to handle either type of status at
> any time.
>
> Add a fallback path to rtas_busy_delay() to handle this as safely as
> possible, performing a small delay on 990x. Include a counter to
> detect retry loops that aren't making progress and bail out.
>
> This was found by inspection and I'm not aware of any real
> failures. However, the implementation of rtas_busy_delay() before
> commit 38f7b7067dae ("powerpc/rtas: rtas_busy_delay() improvements")
> was not susceptible to this problem, so let's treat this as a
> regression.
>
> Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
> Fixes: 38f7b7067dae ("powerpc/rtas: rtas_busy_delay() improvements")
> ---
>  arch/powerpc/kernel/rtas.c | 48 +++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 47 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index 795225d7f138..ec2df09a70cf 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -606,6 +606,46 @@ unsigned int rtas_busy_delay_time(int status)
>  	return ms;
>  }
>  
> +/*
> + * Early boot fallback for rtas_busy_delay().
> + */
> +static bool __init rtas_busy_delay_early(int status)
> +{
> +	static size_t successive_ext_delays __initdata;
> +	bool ret;

I think the logic would be easier to read if this was called "wait", but
maybe that's just me.

> +	switch (status) {
> +	case RTAS_EXTENDED_DELAY_MIN...RTAS_EXTENDED_DELAY_MAX:
> +		/*
> +		 * In the unlikely case that we receive an extended
> +		 * delay status in early boot, the OS is probably not
> +		 * the cause, and there's nothing we can do to clear
> +		 * the condition. Best we can do is delay for a bit
> +		 * and hope it's transient. Lie to the caller if it
> +		 * seems like we're stuck in a retry loop.
> +		 */
> +		mdelay(1);
> +		ret = true;
> +		successive_ext_delays += 1;
> +		if (successive_ext_delays > 1000) {
> +			pr_err("too many extended delays, giving up\n");
> +			dump_stack();
> +			ret = false;

Shouldn't we zero successive_ext_delays here?

Otherwise a subsequent (possibly different) RTAS call will immediately
fail out here if it gets a single extended delay from RTAS, won't it?

> +		}
> +		break;
> +	case RTAS_BUSY:
> +		ret = true;
> +		successive_ext_delays = 0;
> +		break;
> +	default:
> +		ret = false;
> +		successive_ext_delays = 0;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
>  /**
>   * rtas_busy_delay() - helper for RTAS busy and extended delay statuses
>   *
> @@ -624,11 +664,17 @@ unsigned int rtas_busy_delay_time(int status)
>   * * false - @status is not @RTAS_BUSY nor an extended delay hint. The
>   *           caller is responsible for handling @status.
>   */
> -bool rtas_busy_delay(int status)
> +bool __ref rtas_busy_delay(int status)

Can you explain the __ref in the change log.

>  {
>  	unsigned int ms;
>  	bool ret;
>  
> +	/*
> +	 * Can't do timed sleeps before timekeeping is up.
> +	 */
> +	if (system_state < SYSTEM_SCHEDULING)
> +		return rtas_busy_delay_early(status);
> +
>  	switch (status) {
>  	case RTAS_EXTENDED_DELAY_MIN...RTAS_EXTENDED_DELAY_MAX:
>  		ret = true;
>

cheers
Nathan Lynch Feb. 8, 2023, 1:14 p.m. UTC | #2
Michael Ellerman <mpe@ellerman.id.au> writes:
> Nathan Lynch via B4 Submission Endpoint <devnull+nathanl.linux.ibm.com@kernel.org> writes:
>> From: Nathan Lynch <nathanl@linux.ibm.com>
>>
>> Some code that runs early in boot calls RTAS functions that can return
>> -2 or 990x statuses, which mean the caller should retry. An example is
>> pSeries_cmo_feature_init(), which invokes ibm,get-system-parameter but
>> treats these benign statuses as errors instead of retrying.
>>
>> pSeries_cmo_feature_init() and similar code should be made to retry
>> until they succeed or receive a real error, using the usual pattern:
>>
>> 	do {
>> 		rc = rtas_call(token, etc...);
>> 	} while (rtas_busy_delay(rc));
>>
>> But rtas_busy_delay() will perform a timed sleep on any 990x
>> status. This isn't safe so early in boot, before the CPU scheduler and
>> timer subsystem have initialized.
>>
>> The -2 RTAS status is much more likely to occur during single-threaded
>> boot than 990x in practice, at least on PowerVM. This is because -2
>> usually means that RTAS made progress but exhausted its self-imposed
>> timeslice, while 990x is associated with concurrent requests from the
>> OS causing internal contention. Regardless, according to the language
>> in PAPR, the OS should be prepared to handle either type of status at
>> any time.
>>
>> Add a fallback path to rtas_busy_delay() to handle this as safely as
>> possible, performing a small delay on 990x. Include a counter to
>> detect retry loops that aren't making progress and bail out.
>>
>> This was found by inspection and I'm not aware of any real
>> failures. However, the implementation of rtas_busy_delay() before
>> commit 38f7b7067dae ("powerpc/rtas: rtas_busy_delay() improvements")
>> was not susceptible to this problem, so let's treat this as a
>> regression.
>>
>> Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
>> Fixes: 38f7b7067dae ("powerpc/rtas: rtas_busy_delay() improvements")
>> ---
>>  arch/powerpc/kernel/rtas.c | 48 +++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 47 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
>> index 795225d7f138..ec2df09a70cf 100644
>> --- a/arch/powerpc/kernel/rtas.c
>> +++ b/arch/powerpc/kernel/rtas.c
>> @@ -606,6 +606,46 @@ unsigned int rtas_busy_delay_time(int status)
>>  	return ms;
>>  }
>>  
>> +/*
>> + * Early boot fallback for rtas_busy_delay().
>> + */
>> +static bool __init rtas_busy_delay_early(int status)
>> +{
>> +	static size_t successive_ext_delays __initdata;
>> +	bool ret;
>
> I think the logic would be easier to read if this was called "wait", but
> maybe that's just me.

Maybe "retry"? That communicates what the function is telling callers to do.

>
>> +	switch (status) {
>> +	case RTAS_EXTENDED_DELAY_MIN...RTAS_EXTENDED_DELAY_MAX:
>> +		/*
>> +		 * In the unlikely case that we receive an extended
>> +		 * delay status in early boot, the OS is probably not
>> +		 * the cause, and there's nothing we can do to clear
>> +		 * the condition. Best we can do is delay for a bit
>> +		 * and hope it's transient. Lie to the caller if it
>> +		 * seems like we're stuck in a retry loop.
>> +		 */
>> +		mdelay(1);
>> +		ret = true;
>> +		successive_ext_delays += 1;
>> +		if (successive_ext_delays > 1000) {
>> +			pr_err("too many extended delays, giving up\n");
>> +			dump_stack();
>> +			ret = false;
>
> Shouldn't we zero successive_ext_delays here?
>
> Otherwise a subsequent (possibly different) RTAS call will immediately
> fail out here if it gets a single extended delay from RTAS, won't it?

Yes, will fix. Thanks.

>
>> +		}
>> +		break;
>> +	case RTAS_BUSY:
>> +		ret = true;
>> +		successive_ext_delays = 0;
>> +		break;
>> +	default:
>> +		ret = false;
>> +		successive_ext_delays = 0;
>> +		break;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>>  /**
>>   * rtas_busy_delay() - helper for RTAS busy and extended delay statuses
>>   *
>> @@ -624,11 +664,17 @@ unsigned int rtas_busy_delay_time(int status)
>>   * * false - @status is not @RTAS_BUSY nor an extended delay hint. The
>>   *           caller is responsible for handling @status.
>>   */
>> -bool rtas_busy_delay(int status)
>> +bool __ref rtas_busy_delay(int status)
>
> Can you explain the __ref in the change log.

Yes, will add that.


>>  {
>>  	unsigned int ms;
>>  	bool ret;
>>  
>> +	/*
>> +	 * Can't do timed sleeps before timekeeping is up.
>> +	 */
>> +	if (system_state < SYSTEM_SCHEDULING)
>> +		return rtas_busy_delay_early(status);
>> +
>>  	switch (status) {
>>  	case RTAS_EXTENDED_DELAY_MIN...RTAS_EXTENDED_DELAY_MAX:
>>  		ret = true;
>>
>
> cheers
Michael Ellerman Feb. 10, 2023, 5:54 a.m. UTC | #3
Nathan Lynch <nathanl@linux.ibm.com> writes:
> Michael Ellerman <mpe@ellerman.id.au> writes:
>> Nathan Lynch via B4 Submission Endpoint <devnull+nathanl.linux.ibm.com@kernel.org> writes:
>>> From: Nathan Lynch <nathanl@linux.ibm.com>
>>>
>>> Some code that runs early in boot calls RTAS functions that can return
>>> -2 or 990x statuses, which mean the caller should retry. An example is
>>> pSeries_cmo_feature_init(), which invokes ibm,get-system-parameter but
>>> treats these benign statuses as errors instead of retrying.
>>>
>>> pSeries_cmo_feature_init() and similar code should be made to retry
>>> until they succeed or receive a real error, using the usual pattern:
>>>
>>> 	do {
>>> 		rc = rtas_call(token, etc...);
>>> 	} while (rtas_busy_delay(rc));
>>>
>>> But rtas_busy_delay() will perform a timed sleep on any 990x
>>> status. This isn't safe so early in boot, before the CPU scheduler and
>>> timer subsystem have initialized.
>>>
>>> The -2 RTAS status is much more likely to occur during single-threaded
>>> boot than 990x in practice, at least on PowerVM. This is because -2
>>> usually means that RTAS made progress but exhausted its self-imposed
>>> timeslice, while 990x is associated with concurrent requests from the
>>> OS causing internal contention. Regardless, according to the language
>>> in PAPR, the OS should be prepared to handle either type of status at
>>> any time.
>>>
>>> Add a fallback path to rtas_busy_delay() to handle this as safely as
>>> possible, performing a small delay on 990x. Include a counter to
>>> detect retry loops that aren't making progress and bail out.
>>>
>>> This was found by inspection and I'm not aware of any real
>>> failures. However, the implementation of rtas_busy_delay() before
>>> commit 38f7b7067dae ("powerpc/rtas: rtas_busy_delay() improvements")
>>> was not susceptible to this problem, so let's treat this as a
>>> regression.
>>>
>>> Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
>>> Fixes: 38f7b7067dae ("powerpc/rtas: rtas_busy_delay() improvements")
>>> ---
>>>  arch/powerpc/kernel/rtas.c | 48 +++++++++++++++++++++++++++++++++++++++++++++-
>>>  1 file changed, 47 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
>>> index 795225d7f138..ec2df09a70cf 100644
>>> --- a/arch/powerpc/kernel/rtas.c
>>> +++ b/arch/powerpc/kernel/rtas.c
>>> @@ -606,6 +606,46 @@ unsigned int rtas_busy_delay_time(int status)
>>>  	return ms;
>>>  }
>>>  
>>> +/*
>>> + * Early boot fallback for rtas_busy_delay().
>>> + */
>>> +static bool __init rtas_busy_delay_early(int status)
>>> +{
>>> +	static size_t successive_ext_delays __initdata;
>>> +	bool ret;
>>
>> I think the logic would be easier to read if this was called "wait", but
>> maybe that's just me.
>
> Maybe "retry"? That communicates what the function is telling callers to do.

Yeah, that's even better.

cheers
diff mbox series

Patch

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 795225d7f138..ec2df09a70cf 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -606,6 +606,46 @@  unsigned int rtas_busy_delay_time(int status)
 	return ms;
 }
 
+/*
+ * Early boot fallback for rtas_busy_delay().
+ */
+static bool __init rtas_busy_delay_early(int status)
+{
+	static size_t successive_ext_delays __initdata;
+	bool ret;
+
+	switch (status) {
+	case RTAS_EXTENDED_DELAY_MIN...RTAS_EXTENDED_DELAY_MAX:
+		/*
+		 * In the unlikely case that we receive an extended
+		 * delay status in early boot, the OS is probably not
+		 * the cause, and there's nothing we can do to clear
+		 * the condition. Best we can do is delay for a bit
+		 * and hope it's transient. Lie to the caller if it
+		 * seems like we're stuck in a retry loop.
+		 */
+		mdelay(1);
+		ret = true;
+		successive_ext_delays += 1;
+		if (successive_ext_delays > 1000) {
+			pr_err("too many extended delays, giving up\n");
+			dump_stack();
+			ret = false;
+		}
+		break;
+	case RTAS_BUSY:
+		ret = true;
+		successive_ext_delays = 0;
+		break;
+	default:
+		ret = false;
+		successive_ext_delays = 0;
+		break;
+	}
+
+	return ret;
+}
+
 /**
  * rtas_busy_delay() - helper for RTAS busy and extended delay statuses
  *
@@ -624,11 +664,17 @@  unsigned int rtas_busy_delay_time(int status)
  * * false - @status is not @RTAS_BUSY nor an extended delay hint. The
  *           caller is responsible for handling @status.
  */
-bool rtas_busy_delay(int status)
+bool __ref rtas_busy_delay(int status)
 {
 	unsigned int ms;
 	bool ret;
 
+	/*
+	 * Can't do timed sleeps before timekeeping is up.
+	 */
+	if (system_state < SYSTEM_SCHEDULING)
+		return rtas_busy_delay_early(status);
+
 	switch (status) {
 	case RTAS_EXTENDED_DELAY_MIN...RTAS_EXTENDED_DELAY_MAX:
 		ret = true;