diff mbox series

[8/8] powerpc/rtas: consume retry statuses in sys_rtas()

Message ID 20230220-rtas-queue-for-6-4-v1-8-010e4416f13f@linux.ibm.com (mailing list archive)
State Changes Requested
Headers show
Series RTAS changes for 6.4 | expand

Checks

Context Check Description
snowpatch_ozlabs/github-powerpc_ppctests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_selftests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_clang success Successfully ran 6 jobs.
snowpatch_ozlabs/github-powerpc_sparse success Successfully ran 4 jobs.
snowpatch_ozlabs/github-powerpc_kernel_qemu success Successfully ran 24 jobs.

Commit Message

Nathan Lynch via B4 Relay March 6, 2023, 9:33 p.m. UTC
From: Nathan Lynch <nathanl@linux.ibm.com>

The kernel can handle retrying RTAS function calls in response to
-2/990x in the sys_rtas() handler instead of relaying the intermediate
status to user space.

Justifications:

* Currently it's nondeterministic and quite variable in practice
  whether a retry status is returned for any given invocation of
  sys_rtas(). Therefore user space code cannot be expecting a retry
  result without already being broken.

* This tends to significantly reduce the total number of system calls
  issued by programs such as drmgr which make use of sys_rtas(),
  improving the experience of tracing and debugging such
  programs. This is the main motivation for me: I think this change
  will make it easier for us to characterize current sys_rtas() use
  cases as we move them to other interfaces over time.

* It reduces the number of opportunities for user space to leave
  complex operations, such as those associated with DLPAR, incomplete
  and diffcult to recover.

* We can expect performance improvements for existing sys_rtas()
  users, not only because of overall reduction in the number of system
  calls issued, but also due to the better handling of -2/990x in the
  kernel. For example, librtas still sleeps for 1ms on -2, which is
  completely unnecessary.

Performance differences for PHB add and remove on a small P10 PowerVM
partition are included below. For add, elapsed time is slightly
reduced. For remove, there are more significant improvements: the
number of context switches is reduced by an order of magnitude, and
elapsed time is reduced by over half.

(- before, + after):

  Performance counter stats for 'drmgr -c phb -a -s PHB 23' (5 runs):

-          1,847.58 msec task-clock                       #    0.135 CPUs utilized               ( +- 14.15% )
-            10,867      cs                               #    9.800 K/sec                       ( +- 14.14% )
+          1,901.15 msec task-clock                       #    0.148 CPUs utilized               ( +- 14.13% )
+            10,451      cs                               #    9.158 K/sec                       ( +- 14.14% )

-         13.656557 +- 0.000124 seconds time elapsed  ( +-  0.00% )
+          12.88080 +- 0.00404 seconds time elapsed  ( +-  0.03% )

  Performance counter stats for 'drmgr -c phb -r -s PHB 23' (5 runs):

-          1,473.75 msec task-clock                       #    0.092 CPUs utilized               ( +- 14.15% )
-             2,652      cs                               #    3.000 K/sec                       ( +- 14.16% )
+          1,444.55 msec task-clock                       #    0.221 CPUs utilized               ( +- 14.14% )
+               104      cs                               #  119.957 /sec                        ( +- 14.63% )

-          15.99718 +- 0.00801 seconds time elapsed  ( +-  0.05% )
+           6.54256 +- 0.00830 seconds time elapsed  ( +-  0.13% )

Move the existing rtas_lock-guarded critical section in sys_rtas()
into a conventional rtas_busy_delay()-based loop, returning to user
space only when a final success or failure result is available.

Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
---
 arch/powerpc/kernel/rtas.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

Comments

Andrew Donnellan March 23, 2023, 6:26 a.m. UTC | #1
On Mon, 2023-03-06 at 15:33 -0600, Nathan Lynch via B4 Relay wrote:
> From: Nathan Lynch <nathanl@linux.ibm.com>
> 
> The kernel can handle retrying RTAS function calls in response to
> -2/990x in the sys_rtas() handler instead of relaying the
> intermediate
> status to user space.
> 
> Justifications:
> 
> * Currently it's nondeterministic and quite variable in practice
>   whether a retry status is returned for any given invocation of
>   sys_rtas(). Therefore user space code cannot be expecting a retry
>   result without already being broken.
> 
> * This tends to significantly reduce the total number of system calls
>   issued by programs such as drmgr which make use of sys_rtas(),
>   improving the experience of tracing and debugging such
>   programs. This is the main motivation for me: I think this change
>   will make it easier for us to characterize current sys_rtas() use
>   cases as we move them to other interfaces over time.
> 
> * It reduces the number of opportunities for user space to leave
>   complex operations, such as those associated with DLPAR, incomplete
>   and diffcult to recover.
> 
> * We can expect performance improvements for existing sys_rtas()
>   users, not only because of overall reduction in the number of
> system
>   calls issued, but also due to the better handling of -2/990x in the
>   kernel. For example, librtas still sleeps for 1ms on -2, which is
>   completely unnecessary.

Would be good to see this fixed on the librtas side.

> 
> Performance differences for PHB add and remove on a small P10 PowerVM
> partition are included below. For add, elapsed time is slightly
> reduced. For remove, there are more significant improvements: the
> number of context switches is reduced by an order of magnitude, and
> elapsed time is reduced by over half.
> 
> (- before, + after):
> 
>   Performance counter stats for 'drmgr -c phb -a -s PHB 23' (5 runs):
> 
> -          1,847.58 msec task-clock                       #    0.135
> CPUs utilized               ( +- 14.15% )
> -            10,867      cs                               #    9.800
> K/sec                       ( +- 14.14% )
> +          1,901.15 msec task-clock                       #    0.148
> CPUs utilized               ( +- 14.13% )
> +            10,451      cs                               #    9.158
> K/sec                       ( +- 14.14% )
> 
> -         13.656557 +- 0.000124 seconds time elapsed  ( +-  0.00% )
> +          12.88080 +- 0.00404 seconds time elapsed  ( +-  0.03% )
> 
>   Performance counter stats for 'drmgr -c phb -r -s PHB 23' (5 runs):
> 
> -          1,473.75 msec task-clock                       #    0.092
> CPUs utilized               ( +- 14.15% )
> -             2,652      cs                               #    3.000
> K/sec                       ( +- 14.16% )
> +          1,444.55 msec task-clock                       #    0.221
> CPUs utilized               ( +- 14.14% )
> +               104      cs                               #  119.957
> /sec                        ( +- 14.63% )
> 
> -          15.99718 +- 0.00801 seconds time elapsed  ( +-  0.05% )
> +           6.54256 +- 0.00830 seconds time elapsed  ( +-  0.13% )
> 
> Move the existing rtas_lock-guarded critical section in sys_rtas()
> into a conventional rtas_busy_delay()-based loop, returning to user
> space only when a final success or failure result is available.
> 
> Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>

Should there be some kind of timeout? I'm a bit worried by sleeping in
a syscall for an extended period.
Michael Ellerman March 23, 2023, 9:44 a.m. UTC | #2
Nathan Lynch via B4 Relay <devnull+nathanl.linux.ibm.com@kernel.org> writes:
> From: Nathan Lynch <nathanl@linux.ibm.com>
>
> The kernel can handle retrying RTAS function calls in response to
> -2/990x in the sys_rtas() handler instead of relaying the intermediate
> status to user space.

This looks good in general.

One query ...

> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index 47a2aa43d7d4..c330a22ccc70 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -1798,7 +1798,6 @@ static bool block_rtas_call(int token, int nargs,
>  /* We assume to be passed big endian arguments */
>  SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
>  {
> -	struct pin_cookie cookie;
>  	struct rtas_args args;
>  	unsigned long flags;
>  	char *buff_copy, *errbuf = NULL;
> @@ -1866,20 +1865,25 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
>  
>  	buff_copy = get_errorlog_buffer();
>  
> -	raw_spin_lock_irqsave(&rtas_lock, flags);
> -	cookie = lockdep_pin_lock(&rtas_lock);
> +	do {
> +		struct pin_cookie cookie;
>  
> -	rtas_args = args;
> -	do_enter_rtas(&rtas_args);
> -	args = rtas_args;
> +		raw_spin_lock_irqsave(&rtas_lock, flags);
> +		cookie = lockdep_pin_lock(&rtas_lock);
>  
> -	/* A -1 return code indicates that the last command couldn't
> -	   be completed due to a hardware error. */
> -	if (be32_to_cpu(args.rets[0]) == -1)
> -		errbuf = __fetch_rtas_last_error(buff_copy);
> +		rtas_args = args;
> +		do_enter_rtas(&rtas_args);
> +		args = rtas_args;
>  
> -	lockdep_unpin_lock(&rtas_lock, cookie);
> -	raw_spin_unlock_irqrestore(&rtas_lock, flags);
> +		/*
> +		 * Handle error record retrieval before releasing the lock.
> +		 */
> +		if (be32_to_cpu(args.rets[0]) == -1)
> +			errbuf = __fetch_rtas_last_error(buff_copy);
> +
> +		lockdep_unpin_lock(&rtas_lock, cookie);
> +		raw_spin_unlock_irqrestore(&rtas_lock, flags);
> +	} while (rtas_busy_delay(be32_to_cpu(args.rets[0])));

rtas_busy_delay_early() has the successive_ext_delays case that will
break out eventually. But if we keep getting plain RTAS_BUSY back from
RTAS I *think* this loop will never terminate?

To avoid that, and just as good manners, I think we should have a
fatal_signal_pending() check, and if that returns true we bail out of
the syscall with -EINTR ?

cheers
Nathan Lynch March 23, 2023, 1:40 p.m. UTC | #3
Michael Ellerman <mpe@ellerman.id.au> writes:
> Nathan Lynch via B4 Relay <devnull+nathanl.linux.ibm.com@kernel.org> writes:
>> From: Nathan Lynch <nathanl@linux.ibm.com>
>>
>> The kernel can handle retrying RTAS function calls in response to
>> -2/990x in the sys_rtas() handler instead of relaying the intermediate
>> status to user space.
>
> This looks good in general.
>
> One query ...
>
>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
>> index 47a2aa43d7d4..c330a22ccc70 100644
>> --- a/arch/powerpc/kernel/rtas.c
>> +++ b/arch/powerpc/kernel/rtas.c
>> @@ -1798,7 +1798,6 @@ static bool block_rtas_call(int token, int nargs,
>>  /* We assume to be passed big endian arguments */
>>  SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
>>  {
>> -	struct pin_cookie cookie;
>>  	struct rtas_args args;
>>  	unsigned long flags;
>>  	char *buff_copy, *errbuf = NULL;
>> @@ -1866,20 +1865,25 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
>>  
>>  	buff_copy = get_errorlog_buffer();
>>  
>> -	raw_spin_lock_irqsave(&rtas_lock, flags);
>> -	cookie = lockdep_pin_lock(&rtas_lock);
>> +	do {
>> +		struct pin_cookie cookie;
>>  
>> -	rtas_args = args;
>> -	do_enter_rtas(&rtas_args);
>> -	args = rtas_args;
>> +		raw_spin_lock_irqsave(&rtas_lock, flags);
>> +		cookie = lockdep_pin_lock(&rtas_lock);
>>  
>> -	/* A -1 return code indicates that the last command couldn't
>> -	   be completed due to a hardware error. */
>> -	if (be32_to_cpu(args.rets[0]) == -1)
>> -		errbuf = __fetch_rtas_last_error(buff_copy);
>> +		rtas_args = args;
>> +		do_enter_rtas(&rtas_args);
>> +		args = rtas_args;
>>  
>> -	lockdep_unpin_lock(&rtas_lock, cookie);
>> -	raw_spin_unlock_irqrestore(&rtas_lock, flags);
>> +		/*
>> +		 * Handle error record retrieval before releasing the lock.
>> +		 */
>> +		if (be32_to_cpu(args.rets[0]) == -1)
>> +			errbuf = __fetch_rtas_last_error(buff_copy);
>> +
>> +		lockdep_unpin_lock(&rtas_lock, cookie);
>> +		raw_spin_unlock_irqrestore(&rtas_lock, flags);
>> +	} while (rtas_busy_delay(be32_to_cpu(args.rets[0])));
>
> rtas_busy_delay_early() has the successive_ext_delays case that will
> break out eventually. But if we keep getting plain RTAS_BUSY back from
> RTAS I *think* this loop will never terminate?

Yes, but if this happens, then there is a serious bug in Linux or
RTAS. The only time I've seen something like that on PowerVM is when
Linux corrupted internal RTAS state by not serializing calls correctly.

rtas_busy_delay_early() has a bail-out heuristic, not for RTAS_BUSY, but
for extended delay statuses (990x), which I suspect happen rarely (if
ever) that early. That's there in order to allow boot to proceed and
hopefully get useful messages out in a truly unexpected circumstance.

That said...

> To avoid that, and just as good manners, I think we should have a
> fatal_signal_pending() check, and if that returns true we bail out of
> the syscall with -EINTR ?

That probably makes sense. In its current state, I could see
this patch preventing or delaying OS shutdown in situations where it
wouldn't have occurred before.

I think I would want the bailout condition in this case to be
(fatal_signal_pending() && retries > some_threshold), to reduce the
likelihood of non-"stuck" operations from being left unfinished. And it
should dump a stack trace.
Nathan Lynch March 23, 2023, 7:39 p.m. UTC | #4
Andrew Donnellan <ajd@linux.ibm.com> writes:

> On Mon, 2023-03-06 at 15:33 -0600, Nathan Lynch via B4 Relay wrote:
>> * We can expect performance improvements for existing sys_rtas()
>>   users, not only because of overall reduction in the number of
>> system
>>   calls issued, but also due to the better handling of -2/990x in the
>>   kernel. For example, librtas still sleeps for 1ms on -2, which is
>>   completely unnecessary.
>
> Would be good to see this fixed on the librtas side.

Filed an issue: https://github.com/ibm-power-utilities/librtas/issues/30
Christophe Leroy Jan. 25, 2024, 3:55 p.m. UTC | #5
Hi Nathan,

Le 06/03/2023 à 22:33, Nathan Lynch via B4 Relay a écrit :
> From: Nathan Lynch <nathanl@linux.ibm.com>
> 
> The kernel can handle retrying RTAS function calls in response to
> -2/990x in the sys_rtas() handler instead of relaying the intermediate
> status to user space.

 From this series with still have patches 5, 7 and 8 awaiting in 
patchwork, see 
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?submitter=85747 
and patch 8 doesn't apply anymore.

Are those 3 patches still relevant or should they be discarded ?

Thanks
Christophe


> 
> Justifications:
> 
> * Currently it's nondeterministic and quite variable in practice
>    whether a retry status is returned for any given invocation of
>    sys_rtas(). Therefore user space code cannot be expecting a retry
>    result without already being broken.
> 
> * This tends to significantly reduce the total number of system calls
>    issued by programs such as drmgr which make use of sys_rtas(),
>    improving the experience of tracing and debugging such
>    programs. This is the main motivation for me: I think this change
>    will make it easier for us to characterize current sys_rtas() use
>    cases as we move them to other interfaces over time.
> 
> * It reduces the number of opportunities for user space to leave
>    complex operations, such as those associated with DLPAR, incomplete
>    and diffcult to recover.
> 
> * We can expect performance improvements for existing sys_rtas()
>    users, not only because of overall reduction in the number of system
>    calls issued, but also due to the better handling of -2/990x in the
>    kernel. For example, librtas still sleeps for 1ms on -2, which is
>    completely unnecessary.
> 
> Performance differences for PHB add and remove on a small P10 PowerVM
> partition are included below. For add, elapsed time is slightly
> reduced. For remove, there are more significant improvements: the
> number of context switches is reduced by an order of magnitude, and
> elapsed time is reduced by over half.
> 
> (- before, + after):
> 
>    Performance counter stats for 'drmgr -c phb -a -s PHB 23' (5 runs):
> 
> -          1,847.58 msec task-clock                       #    0.135 CPUs utilized               ( +- 14.15% )
> -            10,867      cs                               #    9.800 K/sec                       ( +- 14.14% )
> +          1,901.15 msec task-clock                       #    0.148 CPUs utilized               ( +- 14.13% )
> +            10,451      cs                               #    9.158 K/sec                       ( +- 14.14% )
> 
> -         13.656557 +- 0.000124 seconds time elapsed  ( +-  0.00% )
> +          12.88080 +- 0.00404 seconds time elapsed  ( +-  0.03% )
> 
>    Performance counter stats for 'drmgr -c phb -r -s PHB 23' (5 runs):
> 
> -          1,473.75 msec task-clock                       #    0.092 CPUs utilized               ( +- 14.15% )
> -             2,652      cs                               #    3.000 K/sec                       ( +- 14.16% )
> +          1,444.55 msec task-clock                       #    0.221 CPUs utilized               ( +- 14.14% )
> +               104      cs                               #  119.957 /sec                        ( +- 14.63% )
> 
> -          15.99718 +- 0.00801 seconds time elapsed  ( +-  0.05% )
> +           6.54256 +- 0.00830 seconds time elapsed  ( +-  0.13% )
> 
> Move the existing rtas_lock-guarded critical section in sys_rtas()
> into a conventional rtas_busy_delay()-based loop, returning to user
> space only when a final success or failure result is available.
> 
> Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
> ---
>   arch/powerpc/kernel/rtas.c | 28 ++++++++++++++++------------
>   1 file changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index 47a2aa43d7d4..c330a22ccc70 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -1798,7 +1798,6 @@ static bool block_rtas_call(int token, int nargs,
>   /* We assume to be passed big endian arguments */
>   SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
>   {
> -	struct pin_cookie cookie;
>   	struct rtas_args args;
>   	unsigned long flags;
>   	char *buff_copy, *errbuf = NULL;
> @@ -1866,20 +1865,25 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
>   
>   	buff_copy = get_errorlog_buffer();
>   
> -	raw_spin_lock_irqsave(&rtas_lock, flags);
> -	cookie = lockdep_pin_lock(&rtas_lock);
> +	do {
> +		struct pin_cookie cookie;
>   
> -	rtas_args = args;
> -	do_enter_rtas(&rtas_args);
> -	args = rtas_args;
> +		raw_spin_lock_irqsave(&rtas_lock, flags);
> +		cookie = lockdep_pin_lock(&rtas_lock);
>   
> -	/* A -1 return code indicates that the last command couldn't
> -	   be completed due to a hardware error. */
> -	if (be32_to_cpu(args.rets[0]) == -1)
> -		errbuf = __fetch_rtas_last_error(buff_copy);
> +		rtas_args = args;
> +		do_enter_rtas(&rtas_args);
> +		args = rtas_args;
>   
> -	lockdep_unpin_lock(&rtas_lock, cookie);
> -	raw_spin_unlock_irqrestore(&rtas_lock, flags);
> +		/*
> +		 * Handle error record retrieval before releasing the lock.
> +		 */
> +		if (be32_to_cpu(args.rets[0]) == -1)
> +			errbuf = __fetch_rtas_last_error(buff_copy);
> +
> +		lockdep_unpin_lock(&rtas_lock, cookie);
> +		raw_spin_unlock_irqrestore(&rtas_lock, flags);
> +	} while (rtas_busy_delay(be32_to_cpu(args.rets[0])));
>   
>   	if (buff_copy) {
>   		if (errbuf)
>
Nathan Lynch Jan. 25, 2024, 4:33 p.m. UTC | #6
Christophe Leroy <christophe.leroy@csgroup.eu> writes:
> Hi Nathan,
>
> Le 06/03/2023 à 22:33, Nathan Lynch via B4 Relay a écrit :
>> From: Nathan Lynch <nathanl@linux.ibm.com>
>> 
>> The kernel can handle retrying RTAS function calls in response to
>> -2/990x in the sys_rtas() handler instead of relaying the intermediate
>> status to user space.
>
>  From this series with still have patches 5, 7 and 8 awaiting in 
> patchwork, see 
> https://patchwork.ozlabs.org/project/linuxppc-dev/list/?submitter=85747 
> and patch 8 doesn't apply anymore.
>
> Are those 3 patches still relevant or should they be discarded ?

Thanks for checking - 5 and 7 can be discarded.

I intend to return to 8/8 ("consume retry statuses...") when time
allows. So that could be put in "changes requested" state I suppose, but
if it's easier on the maintainer side to discard it that's fine with me
too.
Christophe Leroy Jan. 25, 2024, 4:46 p.m. UTC | #7
Le 25/01/2024 à 17:33, Nathan Lynch a écrit :
> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
>> Hi Nathan,
>>
>> Le 06/03/2023 à 22:33, Nathan Lynch via B4 Relay a écrit :
>>> From: Nathan Lynch <nathanl@linux.ibm.com>
>>>
>>> The kernel can handle retrying RTAS function calls in response to
>>> -2/990x in the sys_rtas() handler instead of relaying the intermediate
>>> status to user space.
>>
>>   From this series with still have patches 5, 7 and 8 awaiting in
>> patchwork, see
>> https://patchwork.ozlabs.org/project/linuxppc-dev/list/?submitter=85747
>> and patch 8 doesn't apply anymore.
>>
>> Are those 3 patches still relevant or should they be discarded ?
> 
> Thanks for checking - 5 and 7 can be discarded.
> 
> I intend to return to 8/8 ("consume retry statuses...") when time
> allows. So that could be put in "changes requested" state I suppose, but
> if it's easier on the maintainer side to discard it that's fine with me
> too.

Ok, thanks for answering, I rejected 5 and 7 and requested changes to 8.

While at it, what about patch 2 of series 
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=382195&state=*

Christophe
Nathan Lynch Jan. 25, 2024, 5:23 p.m. UTC | #8
Christophe Leroy <christophe.leroy@csgroup.eu> writes:

> Le 25/01/2024 à 17:33, Nathan Lynch a écrit :
>> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
>>> Hi Nathan,
>>>
>>> Le 06/03/2023 à 22:33, Nathan Lynch via B4 Relay a écrit :
>>>> From: Nathan Lynch <nathanl@linux.ibm.com>
>>>>
>>>> The kernel can handle retrying RTAS function calls in response to
>>>> -2/990x in the sys_rtas() handler instead of relaying the intermediate
>>>> status to user space.
>>>
>>>   From this series with still have patches 5, 7 and 8 awaiting in
>>> patchwork, see
>>> https://patchwork.ozlabs.org/project/linuxppc-dev/list/?submitter=85747
>>> and patch 8 doesn't apply anymore.
>>>
>>> Are those 3 patches still relevant or should they be discarded ?
>> 
>> Thanks for checking - 5 and 7 can be discarded.
>> 
>> I intend to return to 8/8 ("consume retry statuses...") when time
>> allows. So that could be put in "changes requested" state I suppose, but
>> if it's easier on the maintainer side to discard it that's fine with me
>> too.
>
> Ok, thanks for answering, I rejected 5 and 7 and requested changes to 8.
>
> While at it, what about patch 2 of series 
> https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=382195&state=*

Drop it, thanks.
diff mbox series

Patch

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 47a2aa43d7d4..c330a22ccc70 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -1798,7 +1798,6 @@  static bool block_rtas_call(int token, int nargs,
 /* We assume to be passed big endian arguments */
 SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
 {
-	struct pin_cookie cookie;
 	struct rtas_args args;
 	unsigned long flags;
 	char *buff_copy, *errbuf = NULL;
@@ -1866,20 +1865,25 @@  SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
 
 	buff_copy = get_errorlog_buffer();
 
-	raw_spin_lock_irqsave(&rtas_lock, flags);
-	cookie = lockdep_pin_lock(&rtas_lock);
+	do {
+		struct pin_cookie cookie;
 
-	rtas_args = args;
-	do_enter_rtas(&rtas_args);
-	args = rtas_args;
+		raw_spin_lock_irqsave(&rtas_lock, flags);
+		cookie = lockdep_pin_lock(&rtas_lock);
 
-	/* A -1 return code indicates that the last command couldn't
-	   be completed due to a hardware error. */
-	if (be32_to_cpu(args.rets[0]) == -1)
-		errbuf = __fetch_rtas_last_error(buff_copy);
+		rtas_args = args;
+		do_enter_rtas(&rtas_args);
+		args = rtas_args;
 
-	lockdep_unpin_lock(&rtas_lock, cookie);
-	raw_spin_unlock_irqrestore(&rtas_lock, flags);
+		/*
+		 * Handle error record retrieval before releasing the lock.
+		 */
+		if (be32_to_cpu(args.rets[0]) == -1)
+			errbuf = __fetch_rtas_last_error(buff_copy);
+
+		lockdep_unpin_lock(&rtas_lock, cookie);
+		raw_spin_unlock_irqrestore(&rtas_lock, flags);
+	} while (rtas_busy_delay(be32_to_cpu(args.rets[0])));
 
 	if (buff_copy) {
 		if (errbuf)