Message ID | 20230220-rtas-queue-for-6-4-v1-8-010e4416f13f@linux.ibm.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | RTAS changes for 6.4 | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/github-powerpc_ppctests | success | Successfully ran 8 jobs. |
snowpatch_ozlabs/github-powerpc_selftests | success | Successfully ran 8 jobs. |
snowpatch_ozlabs/github-powerpc_clang | success | Successfully ran 6 jobs. |
snowpatch_ozlabs/github-powerpc_sparse | success | Successfully ran 4 jobs. |
snowpatch_ozlabs/github-powerpc_kernel_qemu | success | Successfully ran 24 jobs. |
On Mon, 2023-03-06 at 15:33 -0600, Nathan Lynch via B4 Relay wrote: > From: Nathan Lynch <nathanl@linux.ibm.com> > > The kernel can handle retrying RTAS function calls in response to > -2/990x in the sys_rtas() handler instead of relaying the > intermediate > status to user space. > > Justifications: > > * Currently it's nondeterministic and quite variable in practice > whether a retry status is returned for any given invocation of > sys_rtas(). Therefore user space code cannot be expecting a retry > result without already being broken. > > * This tends to significantly reduce the total number of system calls > issued by programs such as drmgr which make use of sys_rtas(), > improving the experience of tracing and debugging such > programs. This is the main motivation for me: I think this change > will make it easier for us to characterize current sys_rtas() use > cases as we move them to other interfaces over time. > > * It reduces the number of opportunities for user space to leave > complex operations, such as those associated with DLPAR, incomplete > and diffcult to recover. > > * We can expect performance improvements for existing sys_rtas() > users, not only because of overall reduction in the number of > system > calls issued, but also due to the better handling of -2/990x in the > kernel. For example, librtas still sleeps for 1ms on -2, which is > completely unnecessary. Would be good to see this fixed on the librtas side. > > Performance differences for PHB add and remove on a small P10 PowerVM > partition are included below. For add, elapsed time is slightly > reduced. For remove, there are more significant improvements: the > number of context switches is reduced by an order of magnitude, and > elapsed time is reduced by over half. > > (- before, + after): > > Performance counter stats for 'drmgr -c phb -a -s PHB 23' (5 runs): > > - 1,847.58 msec task-clock # 0.135 > CPUs utilized ( +- 14.15% ) > - 10,867 cs # 9.800 > K/sec ( +- 14.14% ) > + 1,901.15 msec task-clock # 0.148 > CPUs utilized ( +- 14.13% ) > + 10,451 cs # 9.158 > K/sec ( +- 14.14% ) > > - 13.656557 +- 0.000124 seconds time elapsed ( +- 0.00% ) > + 12.88080 +- 0.00404 seconds time elapsed ( +- 0.03% ) > > Performance counter stats for 'drmgr -c phb -r -s PHB 23' (5 runs): > > - 1,473.75 msec task-clock # 0.092 > CPUs utilized ( +- 14.15% ) > - 2,652 cs # 3.000 > K/sec ( +- 14.16% ) > + 1,444.55 msec task-clock # 0.221 > CPUs utilized ( +- 14.14% ) > + 104 cs # 119.957 > /sec ( +- 14.63% ) > > - 15.99718 +- 0.00801 seconds time elapsed ( +- 0.05% ) > + 6.54256 +- 0.00830 seconds time elapsed ( +- 0.13% ) > > Move the existing rtas_lock-guarded critical section in sys_rtas() > into a conventional rtas_busy_delay()-based loop, returning to user > space only when a final success or failure result is available. > > Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Should there be some kind of timeout? I'm a bit worried by sleeping in a syscall for an extended period.
Nathan Lynch via B4 Relay <devnull+nathanl.linux.ibm.com@kernel.org> writes: > From: Nathan Lynch <nathanl@linux.ibm.com> > > The kernel can handle retrying RTAS function calls in response to > -2/990x in the sys_rtas() handler instead of relaying the intermediate > status to user space. This looks good in general. One query ... > diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c > index 47a2aa43d7d4..c330a22ccc70 100644 > --- a/arch/powerpc/kernel/rtas.c > +++ b/arch/powerpc/kernel/rtas.c > @@ -1798,7 +1798,6 @@ static bool block_rtas_call(int token, int nargs, > /* We assume to be passed big endian arguments */ > SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs) > { > - struct pin_cookie cookie; > struct rtas_args args; > unsigned long flags; > char *buff_copy, *errbuf = NULL; > @@ -1866,20 +1865,25 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs) > > buff_copy = get_errorlog_buffer(); > > - raw_spin_lock_irqsave(&rtas_lock, flags); > - cookie = lockdep_pin_lock(&rtas_lock); > + do { > + struct pin_cookie cookie; > > - rtas_args = args; > - do_enter_rtas(&rtas_args); > - args = rtas_args; > + raw_spin_lock_irqsave(&rtas_lock, flags); > + cookie = lockdep_pin_lock(&rtas_lock); > > - /* A -1 return code indicates that the last command couldn't > - be completed due to a hardware error. */ > - if (be32_to_cpu(args.rets[0]) == -1) > - errbuf = __fetch_rtas_last_error(buff_copy); > + rtas_args = args; > + do_enter_rtas(&rtas_args); > + args = rtas_args; > > - lockdep_unpin_lock(&rtas_lock, cookie); > - raw_spin_unlock_irqrestore(&rtas_lock, flags); > + /* > + * Handle error record retrieval before releasing the lock. > + */ > + if (be32_to_cpu(args.rets[0]) == -1) > + errbuf = __fetch_rtas_last_error(buff_copy); > + > + lockdep_unpin_lock(&rtas_lock, cookie); > + raw_spin_unlock_irqrestore(&rtas_lock, flags); > + } while (rtas_busy_delay(be32_to_cpu(args.rets[0]))); rtas_busy_delay_early() has the successive_ext_delays case that will break out eventually. But if we keep getting plain RTAS_BUSY back from RTAS I *think* this loop will never terminate? To avoid that, and just as good manners, I think we should have a fatal_signal_pending() check, and if that returns true we bail out of the syscall with -EINTR ? cheers
Michael Ellerman <mpe@ellerman.id.au> writes: > Nathan Lynch via B4 Relay <devnull+nathanl.linux.ibm.com@kernel.org> writes: >> From: Nathan Lynch <nathanl@linux.ibm.com> >> >> The kernel can handle retrying RTAS function calls in response to >> -2/990x in the sys_rtas() handler instead of relaying the intermediate >> status to user space. > > This looks good in general. > > One query ... > >> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c >> index 47a2aa43d7d4..c330a22ccc70 100644 >> --- a/arch/powerpc/kernel/rtas.c >> +++ b/arch/powerpc/kernel/rtas.c >> @@ -1798,7 +1798,6 @@ static bool block_rtas_call(int token, int nargs, >> /* We assume to be passed big endian arguments */ >> SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs) >> { >> - struct pin_cookie cookie; >> struct rtas_args args; >> unsigned long flags; >> char *buff_copy, *errbuf = NULL; >> @@ -1866,20 +1865,25 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs) >> >> buff_copy = get_errorlog_buffer(); >> >> - raw_spin_lock_irqsave(&rtas_lock, flags); >> - cookie = lockdep_pin_lock(&rtas_lock); >> + do { >> + struct pin_cookie cookie; >> >> - rtas_args = args; >> - do_enter_rtas(&rtas_args); >> - args = rtas_args; >> + raw_spin_lock_irqsave(&rtas_lock, flags); >> + cookie = lockdep_pin_lock(&rtas_lock); >> >> - /* A -1 return code indicates that the last command couldn't >> - be completed due to a hardware error. */ >> - if (be32_to_cpu(args.rets[0]) == -1) >> - errbuf = __fetch_rtas_last_error(buff_copy); >> + rtas_args = args; >> + do_enter_rtas(&rtas_args); >> + args = rtas_args; >> >> - lockdep_unpin_lock(&rtas_lock, cookie); >> - raw_spin_unlock_irqrestore(&rtas_lock, flags); >> + /* >> + * Handle error record retrieval before releasing the lock. >> + */ >> + if (be32_to_cpu(args.rets[0]) == -1) >> + errbuf = __fetch_rtas_last_error(buff_copy); >> + >> + lockdep_unpin_lock(&rtas_lock, cookie); >> + raw_spin_unlock_irqrestore(&rtas_lock, flags); >> + } while (rtas_busy_delay(be32_to_cpu(args.rets[0]))); > > rtas_busy_delay_early() has the successive_ext_delays case that will > break out eventually. But if we keep getting plain RTAS_BUSY back from > RTAS I *think* this loop will never terminate? Yes, but if this happens, then there is a serious bug in Linux or RTAS. The only time I've seen something like that on PowerVM is when Linux corrupted internal RTAS state by not serializing calls correctly. rtas_busy_delay_early() has a bail-out heuristic, not for RTAS_BUSY, but for extended delay statuses (990x), which I suspect happen rarely (if ever) that early. That's there in order to allow boot to proceed and hopefully get useful messages out in a truly unexpected circumstance. That said... > To avoid that, and just as good manners, I think we should have a > fatal_signal_pending() check, and if that returns true we bail out of > the syscall with -EINTR ? That probably makes sense. In its current state, I could see this patch preventing or delaying OS shutdown in situations where it wouldn't have occurred before. I think I would want the bailout condition in this case to be (fatal_signal_pending() && retries > some_threshold), to reduce the likelihood of non-"stuck" operations from being left unfinished. And it should dump a stack trace.
Andrew Donnellan <ajd@linux.ibm.com> writes: > On Mon, 2023-03-06 at 15:33 -0600, Nathan Lynch via B4 Relay wrote: >> * We can expect performance improvements for existing sys_rtas() >> users, not only because of overall reduction in the number of >> system >> calls issued, but also due to the better handling of -2/990x in the >> kernel. For example, librtas still sleeps for 1ms on -2, which is >> completely unnecessary. > > Would be good to see this fixed on the librtas side. Filed an issue: https://github.com/ibm-power-utilities/librtas/issues/30
Hi Nathan, Le 06/03/2023 à 22:33, Nathan Lynch via B4 Relay a écrit : > From: Nathan Lynch <nathanl@linux.ibm.com> > > The kernel can handle retrying RTAS function calls in response to > -2/990x in the sys_rtas() handler instead of relaying the intermediate > status to user space. From this series with still have patches 5, 7 and 8 awaiting in patchwork, see https://patchwork.ozlabs.org/project/linuxppc-dev/list/?submitter=85747 and patch 8 doesn't apply anymore. Are those 3 patches still relevant or should they be discarded ? Thanks Christophe > > Justifications: > > * Currently it's nondeterministic and quite variable in practice > whether a retry status is returned for any given invocation of > sys_rtas(). Therefore user space code cannot be expecting a retry > result without already being broken. > > * This tends to significantly reduce the total number of system calls > issued by programs such as drmgr which make use of sys_rtas(), > improving the experience of tracing and debugging such > programs. This is the main motivation for me: I think this change > will make it easier for us to characterize current sys_rtas() use > cases as we move them to other interfaces over time. > > * It reduces the number of opportunities for user space to leave > complex operations, such as those associated with DLPAR, incomplete > and diffcult to recover. > > * We can expect performance improvements for existing sys_rtas() > users, not only because of overall reduction in the number of system > calls issued, but also due to the better handling of -2/990x in the > kernel. For example, librtas still sleeps for 1ms on -2, which is > completely unnecessary. > > Performance differences for PHB add and remove on a small P10 PowerVM > partition are included below. For add, elapsed time is slightly > reduced. For remove, there are more significant improvements: the > number of context switches is reduced by an order of magnitude, and > elapsed time is reduced by over half. > > (- before, + after): > > Performance counter stats for 'drmgr -c phb -a -s PHB 23' (5 runs): > > - 1,847.58 msec task-clock # 0.135 CPUs utilized ( +- 14.15% ) > - 10,867 cs # 9.800 K/sec ( +- 14.14% ) > + 1,901.15 msec task-clock # 0.148 CPUs utilized ( +- 14.13% ) > + 10,451 cs # 9.158 K/sec ( +- 14.14% ) > > - 13.656557 +- 0.000124 seconds time elapsed ( +- 0.00% ) > + 12.88080 +- 0.00404 seconds time elapsed ( +- 0.03% ) > > Performance counter stats for 'drmgr -c phb -r -s PHB 23' (5 runs): > > - 1,473.75 msec task-clock # 0.092 CPUs utilized ( +- 14.15% ) > - 2,652 cs # 3.000 K/sec ( +- 14.16% ) > + 1,444.55 msec task-clock # 0.221 CPUs utilized ( +- 14.14% ) > + 104 cs # 119.957 /sec ( +- 14.63% ) > > - 15.99718 +- 0.00801 seconds time elapsed ( +- 0.05% ) > + 6.54256 +- 0.00830 seconds time elapsed ( +- 0.13% ) > > Move the existing rtas_lock-guarded critical section in sys_rtas() > into a conventional rtas_busy_delay()-based loop, returning to user > space only when a final success or failure result is available. > > Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> > --- > arch/powerpc/kernel/rtas.c | 28 ++++++++++++++++------------ > 1 file changed, 16 insertions(+), 12 deletions(-) > > diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c > index 47a2aa43d7d4..c330a22ccc70 100644 > --- a/arch/powerpc/kernel/rtas.c > +++ b/arch/powerpc/kernel/rtas.c > @@ -1798,7 +1798,6 @@ static bool block_rtas_call(int token, int nargs, > /* We assume to be passed big endian arguments */ > SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs) > { > - struct pin_cookie cookie; > struct rtas_args args; > unsigned long flags; > char *buff_copy, *errbuf = NULL; > @@ -1866,20 +1865,25 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs) > > buff_copy = get_errorlog_buffer(); > > - raw_spin_lock_irqsave(&rtas_lock, flags); > - cookie = lockdep_pin_lock(&rtas_lock); > + do { > + struct pin_cookie cookie; > > - rtas_args = args; > - do_enter_rtas(&rtas_args); > - args = rtas_args; > + raw_spin_lock_irqsave(&rtas_lock, flags); > + cookie = lockdep_pin_lock(&rtas_lock); > > - /* A -1 return code indicates that the last command couldn't > - be completed due to a hardware error. */ > - if (be32_to_cpu(args.rets[0]) == -1) > - errbuf = __fetch_rtas_last_error(buff_copy); > + rtas_args = args; > + do_enter_rtas(&rtas_args); > + args = rtas_args; > > - lockdep_unpin_lock(&rtas_lock, cookie); > - raw_spin_unlock_irqrestore(&rtas_lock, flags); > + /* > + * Handle error record retrieval before releasing the lock. > + */ > + if (be32_to_cpu(args.rets[0]) == -1) > + errbuf = __fetch_rtas_last_error(buff_copy); > + > + lockdep_unpin_lock(&rtas_lock, cookie); > + raw_spin_unlock_irqrestore(&rtas_lock, flags); > + } while (rtas_busy_delay(be32_to_cpu(args.rets[0]))); > > if (buff_copy) { > if (errbuf) >
Christophe Leroy <christophe.leroy@csgroup.eu> writes: > Hi Nathan, > > Le 06/03/2023 à 22:33, Nathan Lynch via B4 Relay a écrit : >> From: Nathan Lynch <nathanl@linux.ibm.com> >> >> The kernel can handle retrying RTAS function calls in response to >> -2/990x in the sys_rtas() handler instead of relaying the intermediate >> status to user space. > > From this series with still have patches 5, 7 and 8 awaiting in > patchwork, see > https://patchwork.ozlabs.org/project/linuxppc-dev/list/?submitter=85747 > and patch 8 doesn't apply anymore. > > Are those 3 patches still relevant or should they be discarded ? Thanks for checking - 5 and 7 can be discarded. I intend to return to 8/8 ("consume retry statuses...") when time allows. So that could be put in "changes requested" state I suppose, but if it's easier on the maintainer side to discard it that's fine with me too.
Le 25/01/2024 à 17:33, Nathan Lynch a écrit : > Christophe Leroy <christophe.leroy@csgroup.eu> writes: >> Hi Nathan, >> >> Le 06/03/2023 à 22:33, Nathan Lynch via B4 Relay a écrit : >>> From: Nathan Lynch <nathanl@linux.ibm.com> >>> >>> The kernel can handle retrying RTAS function calls in response to >>> -2/990x in the sys_rtas() handler instead of relaying the intermediate >>> status to user space. >> >> From this series with still have patches 5, 7 and 8 awaiting in >> patchwork, see >> https://patchwork.ozlabs.org/project/linuxppc-dev/list/?submitter=85747 >> and patch 8 doesn't apply anymore. >> >> Are those 3 patches still relevant or should they be discarded ? > > Thanks for checking - 5 and 7 can be discarded. > > I intend to return to 8/8 ("consume retry statuses...") when time > allows. So that could be put in "changes requested" state I suppose, but > if it's easier on the maintainer side to discard it that's fine with me > too. Ok, thanks for answering, I rejected 5 and 7 and requested changes to 8. While at it, what about patch 2 of series https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=382195&state=* Christophe
Christophe Leroy <christophe.leroy@csgroup.eu> writes: > Le 25/01/2024 à 17:33, Nathan Lynch a écrit : >> Christophe Leroy <christophe.leroy@csgroup.eu> writes: >>> Hi Nathan, >>> >>> Le 06/03/2023 à 22:33, Nathan Lynch via B4 Relay a écrit : >>>> From: Nathan Lynch <nathanl@linux.ibm.com> >>>> >>>> The kernel can handle retrying RTAS function calls in response to >>>> -2/990x in the sys_rtas() handler instead of relaying the intermediate >>>> status to user space. >>> >>> From this series with still have patches 5, 7 and 8 awaiting in >>> patchwork, see >>> https://patchwork.ozlabs.org/project/linuxppc-dev/list/?submitter=85747 >>> and patch 8 doesn't apply anymore. >>> >>> Are those 3 patches still relevant or should they be discarded ? >> >> Thanks for checking - 5 and 7 can be discarded. >> >> I intend to return to 8/8 ("consume retry statuses...") when time >> allows. So that could be put in "changes requested" state I suppose, but >> if it's easier on the maintainer side to discard it that's fine with me >> too. > > Ok, thanks for answering, I rejected 5 and 7 and requested changes to 8. > > While at it, what about patch 2 of series > https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=382195&state=* Drop it, thanks.
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index 47a2aa43d7d4..c330a22ccc70 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -1798,7 +1798,6 @@ static bool block_rtas_call(int token, int nargs, /* We assume to be passed big endian arguments */ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs) { - struct pin_cookie cookie; struct rtas_args args; unsigned long flags; char *buff_copy, *errbuf = NULL; @@ -1866,20 +1865,25 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs) buff_copy = get_errorlog_buffer(); - raw_spin_lock_irqsave(&rtas_lock, flags); - cookie = lockdep_pin_lock(&rtas_lock); + do { + struct pin_cookie cookie; - rtas_args = args; - do_enter_rtas(&rtas_args); - args = rtas_args; + raw_spin_lock_irqsave(&rtas_lock, flags); + cookie = lockdep_pin_lock(&rtas_lock); - /* A -1 return code indicates that the last command couldn't - be completed due to a hardware error. */ - if (be32_to_cpu(args.rets[0]) == -1) - errbuf = __fetch_rtas_last_error(buff_copy); + rtas_args = args; + do_enter_rtas(&rtas_args); + args = rtas_args; - lockdep_unpin_lock(&rtas_lock, cookie); - raw_spin_unlock_irqrestore(&rtas_lock, flags); + /* + * Handle error record retrieval before releasing the lock. + */ + if (be32_to_cpu(args.rets[0]) == -1) + errbuf = __fetch_rtas_last_error(buff_copy); + + lockdep_unpin_lock(&rtas_lock, cookie); + raw_spin_unlock_irqrestore(&rtas_lock, flags); + } while (rtas_busy_delay(be32_to_cpu(args.rets[0]))); if (buff_copy) { if (errbuf)