diff mbox series

[v4] powerpc/pseries/vas: Use usleep_range() to support HCALL delay

Message ID 20231227083401.2307526-1-haren@linux.ibm.com (mailing list archive)
State Superseded
Headers show
Series [v4] powerpc/pseries/vas: Use usleep_range() to support HCALL delay | expand

Checks

Context Check Description
snowpatch_ozlabs/github-powerpc_ppctests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_selftests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_sparse success Successfully ran 4 jobs.
snowpatch_ozlabs/github-powerpc_clang success Successfully ran 6 jobs.
snowpatch_ozlabs/github-powerpc_kernel_qemu success Successfully ran 23 jobs.

Commit Message

Haren Myneni Dec. 27, 2023, 8:34 a.m. UTC
VAS allocate, modify and deallocate HCALLs returns
H_LONG_BUSY_ORDER_1_MSEC or H_LONG_BUSY_ORDER_10_MSEC for busy
delay and expects OS to reissue HCALL after that delay. But using
msleep() will often sleep at least 20 msecs even though the
hypervisor suggests OS reissue these HCALLs after 1 or 10msecs.

The open and close VAS window functions hold mutex and then issue
these HCALLs. So these operations can take longer than the
necessary when multiple threads issue open or close window APIs
simultaneously, especially might affect the performance in the
case of repeat open/close APIs for each compression request.
On the large machine configuration which allows more simultaneous
open/close windows (Ex: 240 cores provides 4800 VAS credits), the
user can observe hung task traces in dmesg due to mutex contention
around open/close HCAlls.

So instead of msleep(), use usleep_range() to ensure sleep with
the expected value before issuing HCALL again.

Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Suggested-by: Nathan Lynch <nathanl@linux.ibm.com>

---
v1 -> v2:
- Use usleep_range instead of using RTAS sleep routine as
  suggested by Nathan
v2 -> v3:
- Sleep 10MSecs even for HCALL delay > 10MSecs and the other
  commit / comemnt changes as suggested by Nathan and Ellerman.
v4 -> v3:
- More description in the commit log with the visible impact for
  the current code as suggested by Aneesh
---
 arch/powerpc/platforms/pseries/vas.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

Comments

Aneesh Kumar K.V Jan. 1, 2024, 6:16 a.m. UTC | #1
Haren Myneni <haren@linux.ibm.com> writes:

> VAS allocate, modify and deallocate HCALLs returns
> H_LONG_BUSY_ORDER_1_MSEC or H_LONG_BUSY_ORDER_10_MSEC for busy
> delay and expects OS to reissue HCALL after that delay. But using
> msleep() will often sleep at least 20 msecs even though the
> hypervisor suggests OS reissue these HCALLs after 1 or 10msecs.
>
> The open and close VAS window functions hold mutex and then issue
> these HCALLs. So these operations can take longer than the
> necessary when multiple threads issue open or close window APIs
> simultaneously, especially might affect the performance in the
> case of repeat open/close APIs for each compression request.
> On the large machine configuration which allows more simultaneous
> open/close windows (Ex: 240 cores provides 4800 VAS credits), the
> user can observe hung task traces in dmesg due to mutex contention
> around open/close HCAlls.
>
> So instead of msleep(), use usleep_range() to ensure sleep with
> the expected value before issuing HCALL again.
>
> Signed-off-by: Haren Myneni <haren@linux.ibm.com>
> Suggested-by: Nathan Lynch <nathanl@linux.ibm.com>
>
> ---
> v1 -> v2:
> - Use usleep_range instead of using RTAS sleep routine as
>   suggested by Nathan
> v2 -> v3:
> - Sleep 10MSecs even for HCALL delay > 10MSecs and the other
>   commit / comemnt changes as suggested by Nathan and Ellerman.
> v4 -> v3:
> - More description in the commit log with the visible impact for
>   the current code as suggested by Aneesh
> ---
>  arch/powerpc/platforms/pseries/vas.c | 25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/platforms/pseries/vas.c b/arch/powerpc/platforms/pseries/vas.c
> index 71d52a670d95..5cf81c564d4b 100644
> --- a/arch/powerpc/platforms/pseries/vas.c
> +++ b/arch/powerpc/platforms/pseries/vas.c
> @@ -38,7 +38,30 @@ static long hcall_return_busy_check(long rc)
>  {
>  	/* Check if we are stalled for some time */
>  	if (H_IS_LONG_BUSY(rc)) {
> -		msleep(get_longbusy_msecs(rc));
> +		unsigned int ms;
> +		/*
> +		 * Allocate, Modify and Deallocate HCALLs returns
> +		 * H_LONG_BUSY_ORDER_1_MSEC or H_LONG_BUSY_ORDER_10_MSEC
> +		 * for the long delay. So the sleep time should always
> +		 * be either 1 or 10msecs, but in case if the HCALL
> +		 * returns the long delay > 10 msecs, clamp the sleep
> +		 * time to 10msecs.
> +		 */
> +		ms = clamp(get_longbusy_msecs(rc), 1, 10);
> +
> +		/*
> +		 * msleep() will often sleep at least 20 msecs even
> +		 * though the hypervisor suggests that the OS reissue
> +		 * HCALLs after 1 or 10msecs. Also the delay hint from
> +		 * the HCALL is just a suggestion. So OK to pause for
> +		 * less time than the hinted delay. Use usleep_range()
> +		 * to ensure we don't sleep much longer than actually
> +		 * needed.
> +		 *
> +		 * See Documentation/timers/timers-howto.rst for
> +		 * explanation of the range used here.
> +		 */
> +		usleep_range(ms * 100, ms * 1000);
>

Is there more details on this range? (ms *100, ms * 1000)

can we use USEC_PER_MSEC instead of 1000.



>  		rc = H_BUSY;
>  	} else if (rc == H_BUSY) {
>  		cond_resched();


It would be good to convert this to a helper and switch rtas_busy_delay
to use this new helper. One question though is w.r.t the clamp values.
Does that need to be specific to each hcall? Can we make it generic?

rtas_busy_delay() expliclity check for 20msec. Any reason to do that?
timers-howto.rst suggest > 10msec to use msleep. 

if (ms <= 20)
	usleep_range(ms * 100, ms * 1000);
else
	msleep(ms);
Nathan Lynch Jan. 2, 2024, 3:16 p.m. UTC | #2
"Aneesh Kumar K.V" <aneesh.kumar@kernel.org> writes:
> Haren Myneni <haren@linux.ibm.com> writes:
>
>> diff --git a/arch/powerpc/platforms/pseries/vas.c b/arch/powerpc/platforms/pseries/vas.c
>> index 71d52a670d95..5cf81c564d4b 100644
>> --- a/arch/powerpc/platforms/pseries/vas.c
>> +++ b/arch/powerpc/platforms/pseries/vas.c
>> @@ -38,7 +38,30 @@ static long hcall_return_busy_check(long rc)
>>  {
>>  	/* Check if we are stalled for some time */
>>  	if (H_IS_LONG_BUSY(rc)) {
>> -		msleep(get_longbusy_msecs(rc));
>> +		unsigned int ms;
>> +		/*
>> +		 * Allocate, Modify and Deallocate HCALLs returns
>> +		 * H_LONG_BUSY_ORDER_1_MSEC or H_LONG_BUSY_ORDER_10_MSEC
>> +		 * for the long delay. So the sleep time should always
>> +		 * be either 1 or 10msecs, but in case if the HCALL
>> +		 * returns the long delay > 10 msecs, clamp the sleep
>> +		 * time to 10msecs.
>> +		 */
>> +		ms = clamp(get_longbusy_msecs(rc), 1, 10);
>> +
>> +		/*
>> +		 * msleep() will often sleep at least 20 msecs even
>> +		 * though the hypervisor suggests that the OS reissue
>> +		 * HCALLs after 1 or 10msecs. Also the delay hint from
>> +		 * the HCALL is just a suggestion. So OK to pause for
>> +		 * less time than the hinted delay. Use usleep_range()
>> +		 * to ensure we don't sleep much longer than actually
>> +		 * needed.
>> +		 *
>> +		 * See Documentation/timers/timers-howto.rst for
>> +		 * explanation of the range used here.
>> +		 */
>> +		usleep_range(ms * 100, ms * 1000);
>>
>
> Is there more details on this range? (ms *100, ms * 1000)

The preceding comment ("see Documentation/timers/timers-howto...")
should be removed, that document does not really explain this range
directly.

What timers-howto does say is that the larger a range you provide, the
less likely you are to trigger an interrupt to wake up. Since we know
that retrying "too soon" is harmless, providing a lower bound an order
of magnitude less than the suggested delay (which forms the upper bound)
seems reasonable.

>
> can we use USEC_PER_MSEC instead of 1000.

agreed

>>  		rc = H_BUSY;
>>  	} else if (rc == H_BUSY) {
>>  		cond_resched();
>
>
> It would be good to convert this to a helper and switch rtas_busy_delay
> to use this new helper.

I have reservations about that suggestion.

The logic for handling the 990x extended delay constants conceivably
could be shared. But any helper that handles the "retry immediately"
statuses has to know whether it's handling a status from an RTAS call or
an hcall: RTAS_BUSY and H_BUSY have the same semantics but different
values.

Also I don't really want kernel/rtas.c to gain more dependencies on
pseries-specific code as long as there are non-pseries platforms that
use it (chrp, maple, cell).

Tolerating a little duplication here should be OK IMO.

> One question though is w.r.t the clamp values.
> Does that need to be specific to each hcall? Can we make it generic?
>
> rtas_busy_delay() expliclity check for 20msec. Any reason to do that?
> timers-howto.rst suggest > 10msec to use msleep.

I understand it to suggest (roughly) 20ms for the threshold:

        SLEEPING FOR ~USECS OR SMALL MSECS ( 10us - 20ms):
                * Use usleep_range
[...]
                        msleep(1~20) may not do what the caller intends, and
                        will often sleep longer (~20 ms actual sleep for any
                        value given in the 1~20ms range).

20ms is also the threshold used by fsleep().
diff mbox series

Patch

diff --git a/arch/powerpc/platforms/pseries/vas.c b/arch/powerpc/platforms/pseries/vas.c
index 71d52a670d95..5cf81c564d4b 100644
--- a/arch/powerpc/platforms/pseries/vas.c
+++ b/arch/powerpc/platforms/pseries/vas.c
@@ -38,7 +38,30 @@  static long hcall_return_busy_check(long rc)
 {
 	/* Check if we are stalled for some time */
 	if (H_IS_LONG_BUSY(rc)) {
-		msleep(get_longbusy_msecs(rc));
+		unsigned int ms;
+		/*
+		 * Allocate, Modify and Deallocate HCALLs returns
+		 * H_LONG_BUSY_ORDER_1_MSEC or H_LONG_BUSY_ORDER_10_MSEC
+		 * for the long delay. So the sleep time should always
+		 * be either 1 or 10msecs, but in case if the HCALL
+		 * returns the long delay > 10 msecs, clamp the sleep
+		 * time to 10msecs.
+		 */
+		ms = clamp(get_longbusy_msecs(rc), 1, 10);
+
+		/*
+		 * msleep() will often sleep at least 20 msecs even
+		 * though the hypervisor suggests that the OS reissue
+		 * HCALLs after 1 or 10msecs. Also the delay hint from
+		 * the HCALL is just a suggestion. So OK to pause for
+		 * less time than the hinted delay. Use usleep_range()
+		 * to ensure we don't sleep much longer than actually
+		 * needed.
+		 *
+		 * See Documentation/timers/timers-howto.rst for
+		 * explanation of the range used here.
+		 */
+		usleep_range(ms * 100, ms * 1000);
 		rc = H_BUSY;
 	} else if (rc == H_BUSY) {
 		cond_resched();