Message ID | 20240419135321.70781-10-ajones@ventanamicro.com |
---|---|
State | Changes Requested |
Headers | show |
Series | riscv: Apply Zawrs when available | expand |
Context | Check | Description |
---|---|---|
robh/checkpatch | warning | total: 0 errors, 1 warnings, 18 lines checked |
robh/patch-applied | success | |
robh/dtbs-check | warning | build log |
robh/dt-meta-schema | success |
On Fri, Apr 19, 2024 at 03:53:24PM +0200, Andrew Jones wrote: > Add description for the Zawrs (Wait-on-Reservation-Set) ISA extension > which was ratified in commit 98918c844281 of riscv-isa-manual. > > Signed-off-by: Andrew Jones <ajones@ventanamicro.com> > --- > .../devicetree/bindings/riscv/extensions.yaml | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/Documentation/devicetree/bindings/riscv/extensions.yaml b/Documentation/devicetree/bindings/riscv/extensions.yaml > index 468c646247aa..584da2f539e5 100644 > --- a/Documentation/devicetree/bindings/riscv/extensions.yaml > +++ b/Documentation/devicetree/bindings/riscv/extensions.yaml > @@ -177,6 +177,18 @@ properties: > is supported as ratified at commit 5059e0ca641c ("update to > ratified") of the riscv-zacas. > > + - const: zawrs > + description: | > + The Zawrs extension for entering a low-power state or for trapping > + to a hypervisor while waiting on a store to a memory location, as > + ratified in commit 98918c844281 ("Merge pull request #1217 from > + riscv/zawrs") of riscv-isa-manual. This part is fine... > Linux assumes that WRS.NTO will > + either always eventually terminate the stall due to the reservation > + set becoming invalid, implementation-specific other reasons, or > + because a higher privilege level has configured it to cause an > + illegal instruction exception after an implementation-specific > + bounded time limit. ...but I don't like this bit. The binding should just describe what the property means for the hardware, not discuss specifics about a particular OS. And with my dt-bindings hat off and my kernel hat on, I think that if we want to have more specific requirements than the extension provides we either need to a) document that zawrs means that it will always terminate or b) additionally document a "zawrs-always-terminates" that has that meaning and look for it to enable the behaviour. Documenting something and immediately turning around and saying "this isn't sufficient, let's assume it means more than it does" just seems like we should make firmware tell us exactly what we want. Cheers, Conor.
On Fri, Apr 19, 2024 at 03:45:46PM +0100, Conor Dooley wrote: > On Fri, Apr 19, 2024 at 03:53:24PM +0200, Andrew Jones wrote: > > Add description for the Zawrs (Wait-on-Reservation-Set) ISA extension > > which was ratified in commit 98918c844281 of riscv-isa-manual. > > > > Signed-off-by: Andrew Jones <ajones@ventanamicro.com> > > --- > > .../devicetree/bindings/riscv/extensions.yaml | 12 ++++++++++++ > > 1 file changed, 12 insertions(+) > > > > diff --git a/Documentation/devicetree/bindings/riscv/extensions.yaml b/Documentation/devicetree/bindings/riscv/extensions.yaml > > index 468c646247aa..584da2f539e5 100644 > > --- a/Documentation/devicetree/bindings/riscv/extensions.yaml > > +++ b/Documentation/devicetree/bindings/riscv/extensions.yaml > > @@ -177,6 +177,18 @@ properties: > > is supported as ratified at commit 5059e0ca641c ("update to > > ratified") of the riscv-zacas. > > > > + - const: zawrs > > + description: | > > + The Zawrs extension for entering a low-power state or for trapping > > + to a hypervisor while waiting on a store to a memory location, as > > + ratified in commit 98918c844281 ("Merge pull request #1217 from > > + riscv/zawrs") of riscv-isa-manual. > > This part is fine... > > > > Linux assumes that WRS.NTO will > > + either always eventually terminate the stall due to the reservation > > + set becoming invalid, implementation-specific other reasons, or > > + because a higher privilege level has configured it to cause an > > + illegal instruction exception after an implementation-specific > > + bounded time limit. > > ...but I don't like this bit. The binding should just describe what the > property means for the hardware, not discuss specifics about a > particular OS. > > And with my dt-bindings hat off and my kernel hat on, I think that if we > want to have more specific requirements than the extension provides we > either need to a) document that zawrs means that it will always > terminate or b) additionally document a "zawrs-always-terminates" that > has that meaning and look for it to enable the behaviour. IIUC, the text above mostly just needs to remove 'Linux assumes' in order to provide what we want for (a)? I'm not sure about (b). If Zawrs is unusable as is, then we should probably just go back to the specs and get a new standard extension name for a new version which includes the changes we need. Thanks, drew
On Fri, Apr 19, 2024 at 05:16:05PM +0200, Andrew Jones wrote: > On Fri, Apr 19, 2024 at 03:45:46PM +0100, Conor Dooley wrote: > > On Fri, Apr 19, 2024 at 03:53:24PM +0200, Andrew Jones wrote: > > > Add description for the Zawrs (Wait-on-Reservation-Set) ISA extension > > > which was ratified in commit 98918c844281 of riscv-isa-manual. > > > > > > Signed-off-by: Andrew Jones <ajones@ventanamicro.com> > > > --- > > > .../devicetree/bindings/riscv/extensions.yaml | 12 ++++++++++++ > > > 1 file changed, 12 insertions(+) > > > > > > diff --git a/Documentation/devicetree/bindings/riscv/extensions.yaml b/Documentation/devicetree/bindings/riscv/extensions.yaml > > > index 468c646247aa..584da2f539e5 100644 > > > --- a/Documentation/devicetree/bindings/riscv/extensions.yaml > > > +++ b/Documentation/devicetree/bindings/riscv/extensions.yaml > > > @@ -177,6 +177,18 @@ properties: > > > is supported as ratified at commit 5059e0ca641c ("update to > > > ratified") of the riscv-zacas. > > > > > > + - const: zawrs > > > + description: | > > > + The Zawrs extension for entering a low-power state or for trapping > > > + to a hypervisor while waiting on a store to a memory location, as > > > + ratified in commit 98918c844281 ("Merge pull request #1217 from > > > + riscv/zawrs") of riscv-isa-manual. > > > > This part is fine... > > > > > > > Linux assumes that WRS.NTO will > > > + either always eventually terminate the stall due to the reservation > > > + set becoming invalid, implementation-specific other reasons, or > > > + because a higher privilege level has configured it to cause an > > > + illegal instruction exception after an implementation-specific > > > + bounded time limit. > > > > ...but I don't like this bit. The binding should just describe what the > > property means for the hardware, not discuss specifics about a > > particular OS. > > > > And with my dt-bindings hat off and my kernel hat on, I think that if we > > want to have more specific requirements than the extension provides we > > either need to a) document that zawrs means that it will always > > terminate or b) additionally document a "zawrs-always-terminates" that > > has that meaning and look for it to enable the behaviour. > > IIUC, the text above mostly just needs to remove 'Linux assumes' in order > to provide what we want for (a)? I'm not sure about (b). If Zawrs is > unusable as is, then we should probably just go back to the specs and get > a new standard extension name for a new version which includes the changes > we need. An (official) new name for the behaviour that you actually want, especially if the patchset sent the other day does not have the more stringent requirement (I won't even pretend to understand Zawrs well enough to know whether it does or not), sounds like the ideal outcome. That way you're also sorted on the ACPI side.
On Fri, Apr 19, 2024 at 04:19:52PM +0100, Conor Dooley wrote: > On Fri, Apr 19, 2024 at 05:16:05PM +0200, Andrew Jones wrote: > > On Fri, Apr 19, 2024 at 03:45:46PM +0100, Conor Dooley wrote: > > > On Fri, Apr 19, 2024 at 03:53:24PM +0200, Andrew Jones wrote: > > > > Add description for the Zawrs (Wait-on-Reservation-Set) ISA extension > > > > which was ratified in commit 98918c844281 of riscv-isa-manual. > > > > > > > > Signed-off-by: Andrew Jones <ajones@ventanamicro.com> > > > > --- > > > > .../devicetree/bindings/riscv/extensions.yaml | 12 ++++++++++++ > > > > 1 file changed, 12 insertions(+) > > > > > > > > diff --git a/Documentation/devicetree/bindings/riscv/extensions.yaml b/Documentation/devicetree/bindings/riscv/extensions.yaml > > > > index 468c646247aa..584da2f539e5 100644 > > > > --- a/Documentation/devicetree/bindings/riscv/extensions.yaml > > > > +++ b/Documentation/devicetree/bindings/riscv/extensions.yaml > > > > @@ -177,6 +177,18 @@ properties: > > > > is supported as ratified at commit 5059e0ca641c ("update to > > > > ratified") of the riscv-zacas. > > > > > > > > + - const: zawrs > > > > + description: | > > > > + The Zawrs extension for entering a low-power state or for trapping > > > > + to a hypervisor while waiting on a store to a memory location, as > > > > + ratified in commit 98918c844281 ("Merge pull request #1217 from > > > > + riscv/zawrs") of riscv-isa-manual. > > > > > > This part is fine... > > > > > > > > > > Linux assumes that WRS.NTO will > > > > + either always eventually terminate the stall due to the reservation > > > > + set becoming invalid, implementation-specific other reasons, or > > > > + because a higher privilege level has configured it to cause an > > > > + illegal instruction exception after an implementation-specific > > > > + bounded time limit. > > > > > > ...but I don't like this bit. The binding should just describe what the > > > property means for the hardware, not discuss specifics about a > > > particular OS. > > > > > > And with my dt-bindings hat off and my kernel hat on, I think that if we > > > want to have more specific requirements than the extension provides we > > > either need to a) document that zawrs means that it will always > > > terminate or b) additionally document a "zawrs-always-terminates" that > > > has that meaning and look for it to enable the behaviour. > > > > IIUC, the text above mostly just needs to remove 'Linux assumes' in order > > to provide what we want for (a)? I'm not sure about (b). If Zawrs is > > unusable as is, then we should probably just go back to the specs and get > > a new standard extension name for a new version which includes the changes > > we need. > > An (official) new name for the behaviour that you actually want, especially > if the patchset sent the other day does not have the more stringent > requirement (I won't even pretend to understand Zawrs well enough to know > whether it does or not), sounds like the ideal outcome. That way you're > also sorted on the ACPI side. What would be the purpose of a vendor implementing WRS.NTO (and putting it in the DT) that never terminates? The spec says "Then a subsequent WRS.NTO instruction would cause the hart to temporarily stall execution in a low- power state until a store occurs to the reservation set or an interrupt is observed." Why is this wording for WRS.NTO not sufficient to assume that an implementation of this instruction would eventually terminate? - Charlie
On Fri, Apr 19, 2024 at 12:40:01PM -0400, Charlie Jenkins wrote: > On Fri, Apr 19, 2024 at 04:19:52PM +0100, Conor Dooley wrote: > > On Fri, Apr 19, 2024 at 05:16:05PM +0200, Andrew Jones wrote: > > > On Fri, Apr 19, 2024 at 03:45:46PM +0100, Conor Dooley wrote: > > > > On Fri, Apr 19, 2024 at 03:53:24PM +0200, Andrew Jones wrote: > > > > > Add description for the Zawrs (Wait-on-Reservation-Set) ISA extension > > > > > which was ratified in commit 98918c844281 of riscv-isa-manual. > > > > > > > > > > Signed-off-by: Andrew Jones <ajones@ventanamicro.com> > > > > > --- > > > > > .../devicetree/bindings/riscv/extensions.yaml | 12 ++++++++++++ > > > > > 1 file changed, 12 insertions(+) > > > > > > > > > > diff --git a/Documentation/devicetree/bindings/riscv/extensions.yaml b/Documentation/devicetree/bindings/riscv/extensions.yaml > > > > > index 468c646247aa..584da2f539e5 100644 > > > > > --- a/Documentation/devicetree/bindings/riscv/extensions.yaml > > > > > +++ b/Documentation/devicetree/bindings/riscv/extensions.yaml > > > > > @@ -177,6 +177,18 @@ properties: > > > > > is supported as ratified at commit 5059e0ca641c ("update to > > > > > ratified") of the riscv-zacas. > > > > > > > > > > + - const: zawrs > > > > > + description: | > > > > > + The Zawrs extension for entering a low-power state or for trapping > > > > > + to a hypervisor while waiting on a store to a memory location, as > > > > > + ratified in commit 98918c844281 ("Merge pull request #1217 from > > > > > + riscv/zawrs") of riscv-isa-manual. > > > > > > > > This part is fine... > > > > > > > > > > > > > Linux assumes that WRS.NTO will > > > > > + either always eventually terminate the stall due to the reservation > > > > > + set becoming invalid, implementation-specific other reasons, or > > > > > + because a higher privilege level has configured it to cause an > > > > > + illegal instruction exception after an implementation-specific > > > > > + bounded time limit. > > > > > > > > ...but I don't like this bit. The binding should just describe what the > > > > property means for the hardware, not discuss specifics about a > > > > particular OS. > > > > > > > > And with my dt-bindings hat off and my kernel hat on, I think that if we > > > > want to have more specific requirements than the extension provides we > > > > either need to a) document that zawrs means that it will always > > > > terminate or b) additionally document a "zawrs-always-terminates" that > > > > has that meaning and look for it to enable the behaviour. > > > > > > IIUC, the text above mostly just needs to remove 'Linux assumes' in order > > > to provide what we want for (a)? I'm not sure about (b). If Zawrs is > > > unusable as is, then we should probably just go back to the specs and get > > > a new standard extension name for a new version which includes the changes > > > we need. > > > > An (official) new name for the behaviour that you actually want, especially > > if the patchset sent the other day does not have the more stringent > > requirement (I won't even pretend to understand Zawrs well enough to know > > whether it does or not), sounds like the ideal outcome. That way you're > > also sorted on the ACPI side. > > What would be the purpose of a vendor implementing WRS.NTO (and putting > it in the DT) that never terminates? The spec says "Then a subsequent > WRS.NTO instruction would cause the hart to temporarily stall execution > in a low- power state until a store occurs to the reservation set or an > interrupt is observed." Why is this wording for WRS.NTO not sufficient > to assume that an implementation of this instruction would eventually > terminate? > We can invoke smp_cond_load_relaxed(addr, VAL || anything_we_want()). This means we may not expect VAL ever to be written, which rules out "until a store occurs". As for "an interrupt is observed", we don't know which one to expect to arrive within a "reasonable" amount of time. We need to know which one(s), since, while wrs.nto will terminate even when interrupts are globally disabled, we still need to have the interrupt(s) we expect to be locally enabled. And, the interrupts should arrive in a "reasonable" amount of time since we want to poll anything_we_want() at a "reasonable" frequency. So, we need firmware to promise to enable exceptions if there aren't any such interrupts. Or, we could require hardware descriptions to identify which interrupt(s) would be good to have enabled before calling wrs.nto. Maybe there's already some way to describe something like that? Thanks, drew
On Sun, Apr 21, 2024 at 12:20:03PM +0200, Andrew Jones wrote: > On Fri, Apr 19, 2024 at 12:40:01PM -0400, Charlie Jenkins wrote: > > On Fri, Apr 19, 2024 at 04:19:52PM +0100, Conor Dooley wrote: > > > On Fri, Apr 19, 2024 at 05:16:05PM +0200, Andrew Jones wrote: > > > > On Fri, Apr 19, 2024 at 03:45:46PM +0100, Conor Dooley wrote: > > > > > On Fri, Apr 19, 2024 at 03:53:24PM +0200, Andrew Jones wrote: > > > > > > Add description for the Zawrs (Wait-on-Reservation-Set) ISA extension > > > > > > which was ratified in commit 98918c844281 of riscv-isa-manual. > > > > > > > > > > > > Signed-off-by: Andrew Jones <ajones@ventanamicro.com> > > > > > > --- > > > > > > .../devicetree/bindings/riscv/extensions.yaml | 12 ++++++++++++ > > > > > > 1 file changed, 12 insertions(+) > > > > > > > > > > > > diff --git a/Documentation/devicetree/bindings/riscv/extensions.yaml b/Documentation/devicetree/bindings/riscv/extensions.yaml > > > > > > index 468c646247aa..584da2f539e5 100644 > > > > > > --- a/Documentation/devicetree/bindings/riscv/extensions.yaml > > > > > > +++ b/Documentation/devicetree/bindings/riscv/extensions.yaml > > > > > > @@ -177,6 +177,18 @@ properties: > > > > > > is supported as ratified at commit 5059e0ca641c ("update to > > > > > > ratified") of the riscv-zacas. > > > > > > > > > > > > + - const: zawrs > > > > > > + description: | > > > > > > + The Zawrs extension for entering a low-power state or for trapping > > > > > > + to a hypervisor while waiting on a store to a memory location, as > > > > > > + ratified in commit 98918c844281 ("Merge pull request #1217 from > > > > > > + riscv/zawrs") of riscv-isa-manual. > > > > > > > > > > This part is fine... > > > > > > > > > > > > > > > > Linux assumes that WRS.NTO will > > > > > > + either always eventually terminate the stall due to the reservation > > > > > > + set becoming invalid, implementation-specific other reasons, or > > > > > > + because a higher privilege level has configured it to cause an > > > > > > + illegal instruction exception after an implementation-specific > > > > > > + bounded time limit. > > > > > > > > > > ...but I don't like this bit. The binding should just describe what the > > > > > property means for the hardware, not discuss specifics about a > > > > > particular OS. > > > > > > > > > > And with my dt-bindings hat off and my kernel hat on, I think that if we > > > > > want to have more specific requirements than the extension provides we > > > > > either need to a) document that zawrs means that it will always > > > > > terminate or b) additionally document a "zawrs-always-terminates" that > > > > > has that meaning and look for it to enable the behaviour. > > > > > > > > IIUC, the text above mostly just needs to remove 'Linux assumes' in order > > > > to provide what we want for (a)? I'm not sure about (b). If Zawrs is > > > > unusable as is, then we should probably just go back to the specs and get > > > > a new standard extension name for a new version which includes the changes > > > > we need. > > > > > > An (official) new name for the behaviour that you actually want, especially > > > if the patchset sent the other day does not have the more stringent > > > requirement (I won't even pretend to understand Zawrs well enough to know > > > whether it does or not), sounds like the ideal outcome. That way you're > > > also sorted on the ACPI side. > > > > What would be the purpose of a vendor implementing WRS.NTO (and putting > > it in the DT) that never terminates? The spec says "Then a subsequent > > WRS.NTO instruction would cause the hart to temporarily stall execution > > in a low- power state until a store occurs to the reservation set or an > > interrupt is observed." Why is this wording for WRS.NTO not sufficient > > to assume that an implementation of this instruction would eventually > > terminate? > > > > We can invoke smp_cond_load_relaxed(addr, VAL || anything_we_want()). This > means we may not expect VAL ever to be written, which rules out "until a > store occurs". As for "an interrupt is observed", we don't know which one > to expect to arrive within a "reasonable" amount of time. We need to know > which one(s), since, while wrs.nto will terminate even when interrupts are > globally disabled, we still need to have the interrupt(s) we expect to be > locally enabled. And, the interrupts should arrive in a "reasonable" > amount of time since we want to poll anything_we_want() at a "reasonable" > frequency. > > So, we need firmware to promise to enable exceptions if there aren't any > such interrupts. Or, we could require hardware descriptions to identify > which interrupt(s) would be good to have enabled before calling wrs.nto. > Maybe there's already some way to describe something like that? > > Thanks, > drew Ahh okay I am caught up now. So the wording we are looking at in the spec is: "When executing in VS or VU mode, if the VTW bit is set in hstatus, the TW bit in mstatus is clear, and the WRS.NTO does not complete within an implementation-specific bounded time limit, the WRS.NTO instruction will cause a virtual instruction exception." With the concern being that it is possible for "implementation-specific bounded time limit" to be infinite/never times out, and the kernel enters a WRS where the reservation set is not required to be invalidated for the condition we are waiting on to become true. An option here would be to enforce in the spec that this time limit is finite. If the original intention of the spec was to have it be finite, then this would not be an issue. If the intention was to allow no time limit, then this would probably have to be a new extension. We are also able to change the kernel to not allow these conditions that would break this interpretation of WRS. I found three instances in the kernel that contain a condition that is not dependent on the wrs reservation. 1. # queued_spin_lock_slowpath() in kernel/locking/qspinlock.c val = atomic_cond_read_relaxed(&lock->val, (VAL != _Q_PENDING_VAL) || !cnt--); The first condition will only become true if lock->val changes which should invalidate the reservation. !cnt-- on the otherhand is a counter of the number of loops that happen under-the-hood in atomic_cond_read_relaxed. This seems like an abuse of the function and could be factored out into a new bounded-iteration cond_read macro. 2. # osq_lock() in kernel/locking/osq_lock.c if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() || vcpu_is_preempted(node_cpu(node->prev)))) VAL is the first condition and won't be a problem here since changes to it will cause the reservation to become invalid. arm64 has hard-coded vcpu_is_preempted to be false for the same exact reason that riscv would want to (the wait wouldn't be woken up). There is a comment that points this out in arch/arm64/include/asm/spinlock.h. riscv currently uses the default implementation which returns false. need_resched() should not be a problem since this condition only changes when the hart recieves an IPI, so as long as the hart is able to receive an IPI while in WRS it will be fine. 3. # __arm_smmu_cmdq_poll_until_msi() in drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c smp_cond_load_relaxed(cmd, !VAL || (ret = queue_poll(&qp))); arm driver, not relevant. The only case that would cause a problem in the current implementation of the kernel would be queued_spin_lock_slowpath() with the cnt check. We are able to either change this definition, change the spec, or leave it up to the vendor who would be hit by this issue to change it. - Charlie
On Mon, Apr 22, 2024 at 06:36:45PM -0400, Charlie Jenkins wrote: > On Sun, Apr 21, 2024 at 12:20:03PM +0200, Andrew Jones wrote: > > On Fri, Apr 19, 2024 at 12:40:01PM -0400, Charlie Jenkins wrote: ... > > > What would be the purpose of a vendor implementing WRS.NTO (and putting > > > it in the DT) that never terminates? The spec says "Then a subsequent > > > WRS.NTO instruction would cause the hart to temporarily stall execution > > > in a low- power state until a store occurs to the reservation set or an > > > interrupt is observed." Why is this wording for WRS.NTO not sufficient > > > to assume that an implementation of this instruction would eventually > > > terminate? > > > > > > > We can invoke smp_cond_load_relaxed(addr, VAL || anything_we_want()). This > > means we may not expect VAL ever to be written, which rules out "until a > > store occurs". As for "an interrupt is observed", we don't know which one > > to expect to arrive within a "reasonable" amount of time. We need to know > > which one(s), since, while wrs.nto will terminate even when interrupts are > > globally disabled, we still need to have the interrupt(s) we expect to be > > locally enabled. And, the interrupts should arrive in a "reasonable" > > amount of time since we want to poll anything_we_want() at a "reasonable" > > frequency. > > > > So, we need firmware to promise to enable exceptions if there aren't any > > such interrupts. Or, we could require hardware descriptions to identify > > which interrupt(s) would be good to have enabled before calling wrs.nto. > > Maybe there's already some way to describe something like that? > > > > Thanks, > > drew > > Ahh okay I am caught up now. So the wording we are looking at in the > spec is: > > "When executing in VS or VU mode, if the VTW bit is set in hstatus, the > TW bit in mstatus is clear, and the WRS.NTO does not complete within an > implementation-specific bounded time limit, the WRS.NTO instruction will > cause a virtual instruction exception." That's what the hypervisor should promise to do when there's no other guarantee of wrs.nto terminating (but the hypervisor likely wants to anyway since it wants guests to trap on wrs.nto in order to potentially schedule the lock holding VCPU). The firmware of the host should likewise promise to set mstatus.TW when there's no guarantee of wrs.nto terminating, but that's likely _not_ something it normally would want to do, so hopefully there will always be implementation-specific "other reasons" which guarantee termination. > > With the concern being that it is possible for "implementation-specific > bounded time limit" to be infinite/never times out, The implementation-defined short timeout cannot be infinite, but it only applies to wrs.sto. While using wrs.sto would relieve the concern, it cannot be configured to raise exceptions, which means it's not useful in guests. If we want to use wrs.sto in hosts and wrs.nto in guests then we need a paravirt channel which allows an "enlightened" guest to determine that it is a guest and that the hypervisor has configured wrs.nto to trap, which then indicates it's a good idea to patch wrs.sto to wrs.nto. But, adding paravirt stuff should be avoided whenever possible since it adds complexity we'd rather not maintain. > and the kernel > enters a WRS where the reservation set is not required to be invalidated > for the condition we are waiting on to become true. > > An option here would be to enforce in the spec that this time limit is > finite. If the original intention of the spec was to have it be finite, > then this would not be an issue. If the intention was to allow no time > limit, then this would probably have to be a new extension. wrs.nto has been specified to never timeout. wrs.sto has been specified to never raise exceptions. If we had an instruction which would timeout when mstatus.TW/hstatus.VTW is clear and raise exceptions when set, then that's the one we'd choose. > > We are also able to change the kernel to not allow these conditions that > would break this interpretation of WRS. I found three instances in the > kernel that contain a condition that is not dependent on the wrs > reservation. > > 1. > # queued_spin_lock_slowpath() in kernel/locking/qspinlock.c > val = atomic_cond_read_relaxed(&lock->val, > (VAL != _Q_PENDING_VAL) || !cnt--); > > The first condition will only become true if lock->val changes which > should invalidate the reservation. !cnt-- on the otherhand is a counter > of the number of loops that happen under-the-hood in > atomic_cond_read_relaxed. This seems like an abuse of the function and > could be factored out into a new bounded-iteration cond_read macro. > > 2. > # osq_lock() in kernel/locking/osq_lock.c > if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() || > vcpu_is_preempted(node_cpu(node->prev)))) > > VAL is the first condition and won't be a problem here since changes to > it will cause the reservation to become invalid. arm64 has hard-coded > vcpu_is_preempted to be false for the same exact reason that riscv would > want to (the wait wouldn't be woken up). There is a comment that > points this out in arch/arm64/include/asm/spinlock.h. riscv currently > uses the default implementation which returns false. The operative word is 'currently'. I plan to eventually get riscv's vcpu_is_preempted() working since we already laid the groundwork by adding a preempted flag to the SBI STA struct. > > need_resched() should not be a problem since this condition only changes > when the hart recieves an IPI, so as long as the hart is able to receive > an IPI while in WRS it will be fine. > > 3. > # __arm_smmu_cmdq_poll_until_msi() in drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > smp_cond_load_relaxed(cmd, !VAL || (ret = queue_poll(&qp))); > > arm driver, not relevant. > > > > The only case that would cause a problem in the current implementation > of the kernel would be queued_spin_lock_slowpath() with the cnt check. > We are able to either change this definition, change the spec, or leave > it up to the vendor who would be hit by this issue to change it. We could attempt to document restrictions on the condition given to smp_cond_load_relaxed() and then change the callers to honor those restrictions, but that doesn't sound too easy. How will we remove vcpu_is_preempted() on x86? We should probably start the process of a new extension which has the hybrid wrs.sto/nto instruction, but I'm reluctant to, since, in practice, the current wrs.nto is probably fine. I don't really expect there to be implementations that never terminate, even though I'd rather we document that it's _required_ wrs.nto terminates, or that exceptions be raised. Maybe I'm attempting to document it in the wrong place though. Maybe this is more of a Documentation/arch/riscv/boot.rst type of thing. Thanks, drew
On Tue, Apr 23, 2024 at 10:46:01AM +0200, Andrew Jones wrote: > Maybe I'm attempting to document it in the wrong place though. Maybe this > is more of a Documentation/arch/riscv/boot.rst type of thing. It might be good to document it there also, but I'd like to avoid being unable to rely on what firmware tells us is supported because we have a stricter requirement.
On Tue, Apr 23, 2024 at 10:46:01AM +0200, Andrew Jones wrote: > On Mon, Apr 22, 2024 at 06:36:45PM -0400, Charlie Jenkins wrote: > > On Sun, Apr 21, 2024 at 12:20:03PM +0200, Andrew Jones wrote: > > > On Fri, Apr 19, 2024 at 12:40:01PM -0400, Charlie Jenkins wrote: > ... > > > > What would be the purpose of a vendor implementing WRS.NTO (and putting > > > > it in the DT) that never terminates? The spec says "Then a subsequent > > > > WRS.NTO instruction would cause the hart to temporarily stall execution > > > > in a low- power state until a store occurs to the reservation set or an > > > > interrupt is observed." Why is this wording for WRS.NTO not sufficient > > > > to assume that an implementation of this instruction would eventually > > > > terminate? > > > > > > > > > > We can invoke smp_cond_load_relaxed(addr, VAL || anything_we_want()). This > > > means we may not expect VAL ever to be written, which rules out "until a > > > store occurs". As for "an interrupt is observed", we don't know which one > > > to expect to arrive within a "reasonable" amount of time. We need to know > > > which one(s), since, while wrs.nto will terminate even when interrupts are > > > globally disabled, we still need to have the interrupt(s) we expect to be > > > locally enabled. And, the interrupts should arrive in a "reasonable" > > > amount of time since we want to poll anything_we_want() at a "reasonable" > > > frequency. > > > > > > So, we need firmware to promise to enable exceptions if there aren't any > > > such interrupts. Or, we could require hardware descriptions to identify > > > which interrupt(s) would be good to have enabled before calling wrs.nto. > > > Maybe there's already some way to describe something like that? > > > > > > Thanks, > > > drew > > > > Ahh okay I am caught up now. So the wording we are looking at in the > > spec is: > > > > "When executing in VS or VU mode, if the VTW bit is set in hstatus, the > > TW bit in mstatus is clear, and the WRS.NTO does not complete within an > > implementation-specific bounded time limit, the WRS.NTO instruction will > > cause a virtual instruction exception." > > That's what the hypervisor should promise to do when there's no other > guarantee of wrs.nto terminating (but the hypervisor likely wants to > anyway since it wants guests to trap on wrs.nto in order to potentially > schedule the lock holding VCPU). The firmware of the host should likewise > promise to set mstatus.TW when there's no guarantee of wrs.nto > terminating, but that's likely _not_ something it normally would want to > do, so hopefully there will always be implementation-specific "other > reasons" which guarantee termination. > > > > > With the concern being that it is possible for "implementation-specific > > bounded time limit" to be infinite/never times out, > > The implementation-defined short timeout cannot be infinite, but it only > applies to wrs.sto. While using wrs.sto would relieve the concern, it > cannot be configured to raise exceptions, which means it's not useful in > guests. If we want to use wrs.sto in hosts and wrs.nto in guests then we > need a paravirt channel which allows an "enlightened" guest to determine > that it is a guest and that the hypervisor has configured wrs.nto to > trap, which then indicates it's a good idea to patch wrs.sto to wrs.nto. > But, adding paravirt stuff should be avoided whenever possible since it > adds complexity we'd rather not maintain. > That still wouldn't solve this issue, because the wrs.nto guest may still never wakeup in the implementation-specific way? > > and the kernel > > enters a WRS where the reservation set is not required to be invalidated > > for the condition we are waiting on to become true. > > > > An option here would be to enforce in the spec that this time limit is > > finite. If the original intention of the spec was to have it be finite, > > then this would not be an issue. If the intention was to allow no time > > limit, then this would probably have to be a new extension. > > wrs.nto has been specified to never timeout. > wrs.sto has been specified to never raise exceptions. > > If we had an instruction which would timeout when mstatus.TW/hstatus.VTW > is clear and raise exceptions when set, then that's the one we'd choose. > Yes, this does seem like the ideal situtation. > > > > We are also able to change the kernel to not allow these conditions that > > would break this interpretation of WRS. I found three instances in the > > kernel that contain a condition that is not dependent on the wrs > > reservation. > > > > 1. > > # queued_spin_lock_slowpath() in kernel/locking/qspinlock.c > > val = atomic_cond_read_relaxed(&lock->val, > > (VAL != _Q_PENDING_VAL) || !cnt--); > > > > The first condition will only become true if lock->val changes which > > should invalidate the reservation. !cnt-- on the otherhand is a counter > > of the number of loops that happen under-the-hood in > > atomic_cond_read_relaxed. This seems like an abuse of the function and > > could be factored out into a new bounded-iteration cond_read macro. > > > > 2. > > # osq_lock() in kernel/locking/osq_lock.c > > if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() || > > vcpu_is_preempted(node_cpu(node->prev)))) > > > > VAL is the first condition and won't be a problem here since changes to > > it will cause the reservation to become invalid. arm64 has hard-coded > > vcpu_is_preempted to be false for the same exact reason that riscv would > > want to (the wait wouldn't be woken up). There is a comment that > > points this out in arch/arm64/include/asm/spinlock.h. riscv currently > > uses the default implementation which returns false. > > The operative word is 'currently'. I plan to eventually get riscv's > vcpu_is_preempted() working since we already laid the groundwork by > adding a preempted flag to the SBI STA struct. > > > > > need_resched() should not be a problem since this condition only changes > > when the hart recieves an IPI, so as long as the hart is able to receive > > an IPI while in WRS it will be fine. > > > > 3. > > # __arm_smmu_cmdq_poll_until_msi() in drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > smp_cond_load_relaxed(cmd, !VAL || (ret = queue_poll(&qp))); > > > > arm driver, not relevant. > > > > > > > > The only case that would cause a problem in the current implementation > > of the kernel would be queued_spin_lock_slowpath() with the cnt check. > > We are able to either change this definition, change the spec, or leave > > it up to the vendor who would be hit by this issue to change it. > > We could attempt to document restrictions on the condition given to > smp_cond_load_relaxed() and then change the callers to honor those > restrictions, but that doesn't sound too easy. How will we remove > vcpu_is_preempted() on x86? The solution here seems like it would be to not use wrs for riscv in this case. We really would need an alternate extension that allows wrs in a guest to be guaranteed to trap into the host at some point. > > We should probably start the process of a new extension which has the > hybrid wrs.sto/nto instruction, but I'm reluctant to, since, in practice, > the current wrs.nto is probably fine. I don't really expect there to be > implementations that never terminate, even though I'd rather we document > that it's _required_ wrs.nto terminates, or that exceptions be raised. > Maybe I'm attempting to document it in the wrong place though. Maybe this > is more of a Documentation/arch/riscv/boot.rst type of thing. > wrs.nto is most likely sufficient. A new extension will take a long time. We could go ahead with wrs.nto, propose the extension, and when the extension is ready switch over to it. In the meantime have this behavior documented. > Thanks, > drew - Charlie
On Tue, Apr 23, 2024 at 02:00:53PM -0400, Charlie Jenkins wrote: > On Tue, Apr 23, 2024 at 10:46:01AM +0200, Andrew Jones wrote: > > On Mon, Apr 22, 2024 at 06:36:45PM -0400, Charlie Jenkins wrote: > > > On Sun, Apr 21, 2024 at 12:20:03PM +0200, Andrew Jones wrote: > > > > On Fri, Apr 19, 2024 at 12:40:01PM -0400, Charlie Jenkins wrote: > > ... > > > > > What would be the purpose of a vendor implementing WRS.NTO (and putting > > > > > it in the DT) that never terminates? The spec says "Then a subsequent > > > > > WRS.NTO instruction would cause the hart to temporarily stall execution > > > > > in a low- power state until a store occurs to the reservation set or an > > > > > interrupt is observed." Why is this wording for WRS.NTO not sufficient > > > > > to assume that an implementation of this instruction would eventually > > > > > terminate? > > > > > > > > > > > > > We can invoke smp_cond_load_relaxed(addr, VAL || anything_we_want()). This > > > > means we may not expect VAL ever to be written, which rules out "until a > > > > store occurs". As for "an interrupt is observed", we don't know which one > > > > to expect to arrive within a "reasonable" amount of time. We need to know > > > > which one(s), since, while wrs.nto will terminate even when interrupts are > > > > globally disabled, we still need to have the interrupt(s) we expect to be > > > > locally enabled. And, the interrupts should arrive in a "reasonable" > > > > amount of time since we want to poll anything_we_want() at a "reasonable" > > > > frequency. > > > > > > > > So, we need firmware to promise to enable exceptions if there aren't any > > > > such interrupts. Or, we could require hardware descriptions to identify > > > > which interrupt(s) would be good to have enabled before calling wrs.nto. > > > > Maybe there's already some way to describe something like that? > > > > > > > > Thanks, > > > > drew > > > > > > Ahh okay I am caught up now. So the wording we are looking at in the > > > spec is: > > > > > > "When executing in VS or VU mode, if the VTW bit is set in hstatus, the > > > TW bit in mstatus is clear, and the WRS.NTO does not complete within an > > > implementation-specific bounded time limit, the WRS.NTO instruction will > > > cause a virtual instruction exception." > > > > That's what the hypervisor should promise to do when there's no other > > guarantee of wrs.nto terminating (but the hypervisor likely wants to > > anyway since it wants guests to trap on wrs.nto in order to potentially > > schedule the lock holding VCPU). The firmware of the host should likewise > > promise to set mstatus.TW when there's no guarantee of wrs.nto > > terminating, but that's likely _not_ something it normally would want to > > do, so hopefully there will always be implementation-specific "other > > reasons" which guarantee termination. > > > > > > > > With the concern being that it is possible for "implementation-specific > > > bounded time limit" to be infinite/never times out, > > > > The implementation-defined short timeout cannot be infinite, but it only > > applies to wrs.sto. While using wrs.sto would relieve the concern, it > > cannot be configured to raise exceptions, which means it's not useful in > > guests. If we want to use wrs.sto in hosts and wrs.nto in guests then we > > need a paravirt channel which allows an "enlightened" guest to determine > > that it is a guest and that the hypervisor has configured wrs.nto to > > trap, which then indicates it's a good idea to patch wrs.sto to wrs.nto. > > But, adding paravirt stuff should be avoided whenever possible since it > > adds complexity we'd rather not maintain. > > > > That still wouldn't solve this issue, because the wrs.nto guest may still > never wakeup in the implementation-specific way? > > > > and the kernel > > > enters a WRS where the reservation set is not required to be invalidated > > > for the condition we are waiting on to become true. > > > > > > An option here would be to enforce in the spec that this time limit is > > > finite. If the original intention of the spec was to have it be finite, > > > then this would not be an issue. If the intention was to allow no time > > > limit, then this would probably have to be a new extension. > > > > wrs.nto has been specified to never timeout. > > wrs.sto has been specified to never raise exceptions. > > > > If we had an instruction which would timeout when mstatus.TW/hstatus.VTW > > is clear and raise exceptions when set, then that's the one we'd choose. > > > > Yes, this does seem like the ideal situtation. > > > > > > > We are also able to change the kernel to not allow these conditions that > > > would break this interpretation of WRS. I found three instances in the > > > kernel that contain a condition that is not dependent on the wrs > > > reservation. > > > > > > 1. > > > # queued_spin_lock_slowpath() in kernel/locking/qspinlock.c > > > val = atomic_cond_read_relaxed(&lock->val, > > > (VAL != _Q_PENDING_VAL) || !cnt--); > > > > > > The first condition will only become true if lock->val changes which > > > should invalidate the reservation. !cnt-- on the otherhand is a counter > > > of the number of loops that happen under-the-hood in > > > atomic_cond_read_relaxed. This seems like an abuse of the function and > > > could be factored out into a new bounded-iteration cond_read macro. > > > > > > 2. > > > # osq_lock() in kernel/locking/osq_lock.c > > > if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() || > > > vcpu_is_preempted(node_cpu(node->prev)))) > > > > > > VAL is the first condition and won't be a problem here since changes to > > > it will cause the reservation to become invalid. arm64 has hard-coded > > > vcpu_is_preempted to be false for the same exact reason that riscv would > > > want to (the wait wouldn't be woken up). There is a comment that > > > points this out in arch/arm64/include/asm/spinlock.h. riscv currently > > > uses the default implementation which returns false. > > > > The operative word is 'currently'. I plan to eventually get riscv's > > vcpu_is_preempted() working since we already laid the groundwork by > > adding a preempted flag to the SBI STA struct. > > > > > > > > need_resched() should not be a problem since this condition only changes > > > when the hart recieves an IPI, so as long as the hart is able to receive > > > an IPI while in WRS it will be fine. > > > > > > 3. > > > # __arm_smmu_cmdq_poll_until_msi() in drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > > smp_cond_load_relaxed(cmd, !VAL || (ret = queue_poll(&qp))); > > > > > > arm driver, not relevant. > > > > > > > > > > > > The only case that would cause a problem in the current implementation > > > of the kernel would be queued_spin_lock_slowpath() with the cnt check. > > > We are able to either change this definition, change the spec, or leave > > > it up to the vendor who would be hit by this issue to change it. > > > > We could attempt to document restrictions on the condition given to > > smp_cond_load_relaxed() and then change the callers to honor those > > restrictions, but that doesn't sound too easy. How will we remove > > vcpu_is_preempted() on x86? > > The solution here seems like it would be to not use wrs for riscv in > this case. We really would need an alternate extension that allows wrs > in a guest to be guaranteed to trap into the host at some point. Thinking about this a bit more, this is a performance penalty and not a correctness issue. This line is an optimization that allows the lock holder to jump the queue if the holder directly in front is a preempted vcpu. Eventually the vcpu will be scheduled again and give up the lock. So an implementation of WRS.NTO that does not have the "implementation-specific bounded time limit" that the spec calls out for WRS.NTO to raise a virtual instruction exception, would still function, just slower. I am not sure where the line we would draw here. Using WRS here would be very beneficial for most implementations, but will not be optimally efficient on some implementations of WRS in virtualization. We could make a note that for optimal efficiency Linux recommends that a virtual instruction exception is eventually thrown. We also wouldn't need to mess with the cnt variable since that can only increase during a loop by the thread that is doing the loop. It is an optimization to break out early. Again this is purely performance. If the implementation never breaks out without the lock variable being changed, it will just wait until the variable is changed rather than breaking out early. - Charlie > > > > > We should probably start the process of a new extension which has the > > hybrid wrs.sto/nto instruction, but I'm reluctant to, since, in practice, > > the current wrs.nto is probably fine. I don't really expect there to be > > implementations that never terminate, even though I'd rather we document > > that it's _required_ wrs.nto terminates, or that exceptions be raised. > > Maybe I'm attempting to document it in the wrong place though. Maybe this > > is more of a Documentation/arch/riscv/boot.rst type of thing. > > > > wrs.nto is most likely sufficient. A new extension will take a long > time. We could go ahead with wrs.nto, propose the extension, and when > the extension is ready switch over to it. In the meantime have this > behavior documented. > > > Thanks, > > drew > > - Charlie >
On Tue, Apr 23, 2024 at 03:42:47PM -0400, Charlie Jenkins wrote: > On Tue, Apr 23, 2024 at 02:00:53PM -0400, Charlie Jenkins wrote: > > On Tue, Apr 23, 2024 at 10:46:01AM +0200, Andrew Jones wrote: > > > On Mon, Apr 22, 2024 at 06:36:45PM -0400, Charlie Jenkins wrote: > > > > On Sun, Apr 21, 2024 at 12:20:03PM +0200, Andrew Jones wrote: > > > > > On Fri, Apr 19, 2024 at 12:40:01PM -0400, Charlie Jenkins wrote: > > > ... > > > > > > What would be the purpose of a vendor implementing WRS.NTO (and putting > > > > > > it in the DT) that never terminates? The spec says "Then a subsequent > > > > > > WRS.NTO instruction would cause the hart to temporarily stall execution > > > > > > in a low- power state until a store occurs to the reservation set or an > > > > > > interrupt is observed." Why is this wording for WRS.NTO not sufficient > > > > > > to assume that an implementation of this instruction would eventually > > > > > > terminate? > > > > > > > > > > > > > > > > We can invoke smp_cond_load_relaxed(addr, VAL || anything_we_want()). This > > > > > means we may not expect VAL ever to be written, which rules out "until a > > > > > store occurs". As for "an interrupt is observed", we don't know which one > > > > > to expect to arrive within a "reasonable" amount of time. We need to know > > > > > which one(s), since, while wrs.nto will terminate even when interrupts are > > > > > globally disabled, we still need to have the interrupt(s) we expect to be > > > > > locally enabled. And, the interrupts should arrive in a "reasonable" > > > > > amount of time since we want to poll anything_we_want() at a "reasonable" > > > > > frequency. > > > > > > > > > > So, we need firmware to promise to enable exceptions if there aren't any > > > > > such interrupts. Or, we could require hardware descriptions to identify > > > > > which interrupt(s) would be good to have enabled before calling wrs.nto. > > > > > Maybe there's already some way to describe something like that? > > > > > > > > > > Thanks, > > > > > drew > > > > > > > > Ahh okay I am caught up now. So the wording we are looking at in the > > > > spec is: > > > > > > > > "When executing in VS or VU mode, if the VTW bit is set in hstatus, the > > > > TW bit in mstatus is clear, and the WRS.NTO does not complete within an > > > > implementation-specific bounded time limit, the WRS.NTO instruction will > > > > cause a virtual instruction exception." > > > > > > That's what the hypervisor should promise to do when there's no other > > > guarantee of wrs.nto terminating (but the hypervisor likely wants to > > > anyway since it wants guests to trap on wrs.nto in order to potentially > > > schedule the lock holding VCPU). The firmware of the host should likewise > > > promise to set mstatus.TW when there's no guarantee of wrs.nto > > > terminating, but that's likely _not_ something it normally would want to > > > do, so hopefully there will always be implementation-specific "other > > > reasons" which guarantee termination. > > > > > > > > > > > With the concern being that it is possible for "implementation-specific > > > > bounded time limit" to be infinite/never times out, > > > > > > The implementation-defined short timeout cannot be infinite, but it only > > > applies to wrs.sto. While using wrs.sto would relieve the concern, it > > > cannot be configured to raise exceptions, which means it's not useful in > > > guests. If we want to use wrs.sto in hosts and wrs.nto in guests then we > > > need a paravirt channel which allows an "enlightened" guest to determine > > > that it is a guest and that the hypervisor has configured wrs.nto to > > > trap, which then indicates it's a good idea to patch wrs.sto to wrs.nto. > > > But, adding paravirt stuff should be avoided whenever possible since it > > > adds complexity we'd rather not maintain. > > > > > > > That still wouldn't solve this issue, because the wrs.nto guest may still > > never wakeup in the implementation-specific way? The paravirt approach does solve it, because wrs.nto is specified to raise exceptions after an implementation-specific bounded time limit when the hypervisor sets hstatus.VTW. > > Thinking about this a bit more, this is a performance penalty and not a > correctness issue. It's incorrect to have a design that is likely to result in bad performance. > This line is an optimization that allows the lock > holder to jump the queue if the holder directly in front is a preempted > vcpu. Eventually the vcpu will be scheduled again and give up the lock. > So an implementation of WRS.NTO that does not have the > "implementation-specific bounded time limit" that the spec calls out for > WRS.NTO to raise a virtual instruction exception, would still function, > just slower. The problem isn't specific to virtualization. The problem is using wrs.nto when it has not been configured to raise exceptions and there are not any other guaranteed termination events (other than the reservation set becoming invalid). While the paravirt solution is virtualization specific, it works, because we would then only use wrs.nto when we know we can, and otherwise use wrs.sto. But, as I said, I'd rather not have a paravirt solution. Thanks, drew
On Wed, Apr 24, 2024 at 9:34 AM Andrew Jones <ajones@ventanamicro.com> wrote: > > On Tue, Apr 23, 2024 at 03:42:47PM -0400, Charlie Jenkins wrote: > > On Tue, Apr 23, 2024 at 02:00:53PM -0400, Charlie Jenkins wrote: > > > On Tue, Apr 23, 2024 at 10:46:01AM +0200, Andrew Jones wrote: > > > > On Mon, Apr 22, 2024 at 06:36:45PM -0400, Charlie Jenkins wrote: > > > > > On Sun, Apr 21, 2024 at 12:20:03PM +0200, Andrew Jones wrote: > > > > > > On Fri, Apr 19, 2024 at 12:40:01PM -0400, Charlie Jenkins wrote: > > > > ... > > > > > > > What would be the purpose of a vendor implementing WRS.NTO (and putting > > > > > > > it in the DT) that never terminates? The spec says "Then a subsequent > > > > > > > WRS.NTO instruction would cause the hart to temporarily stall execution > > > > > > > in a low- power state until a store occurs to the reservation set or an > > > > > > > interrupt is observed." Why is this wording for WRS.NTO not sufficient > > > > > > > to assume that an implementation of this instruction would eventually > > > > > > > terminate? > > > > > > > > > > > > > > > > > > > We can invoke smp_cond_load_relaxed(addr, VAL || anything_we_want()). This > > > > > > means we may not expect VAL ever to be written, which rules out "until a > > > > > > store occurs". As for "an interrupt is observed", we don't know which one > > > > > > to expect to arrive within a "reasonable" amount of time. We need to know > > > > > > which one(s), since, while wrs.nto will terminate even when interrupts are > > > > > > globally disabled, we still need to have the interrupt(s) we expect to be > > > > > > locally enabled. And, the interrupts should arrive in a "reasonable" > > > > > > amount of time since we want to poll anything_we_want() at a "reasonable" > > > > > > frequency. > > > > > > > > > > > > So, we need firmware to promise to enable exceptions if there aren't any > > > > > > such interrupts. Or, we could require hardware descriptions to identify > > > > > > which interrupt(s) would be good to have enabled before calling wrs.nto. > > > > > > Maybe there's already some way to describe something like that? > > > > > > > > > > > > Thanks, > > > > > > drew > > > > > > > > > > Ahh okay I am caught up now. So the wording we are looking at in the > > > > > spec is: > > > > > > > > > > "When executing in VS or VU mode, if the VTW bit is set in hstatus, the > > > > > TW bit in mstatus is clear, and the WRS.NTO does not complete within an > > > > > implementation-specific bounded time limit, the WRS.NTO instruction will > > > > > cause a virtual instruction exception." > > > > > > > > That's what the hypervisor should promise to do when there's no other > > > > guarantee of wrs.nto terminating (but the hypervisor likely wants to > > > > anyway since it wants guests to trap on wrs.nto in order to potentially > > > > schedule the lock holding VCPU). The firmware of the host should likewise > > > > promise to set mstatus.TW when there's no guarantee of wrs.nto > > > > terminating, but that's likely _not_ something it normally would want to > > > > do, so hopefully there will always be implementation-specific "other > > > > reasons" which guarantee termination. > > > > > > > > > > > > > > With the concern being that it is possible for "implementation-specific > > > > > bounded time limit" to be infinite/never times out, > > > > > > > > The implementation-defined short timeout cannot be infinite, but it only > > > > applies to wrs.sto. While using wrs.sto would relieve the concern, it > > > > cannot be configured to raise exceptions, which means it's not useful in > > > > guests. If we want to use wrs.sto in hosts and wrs.nto in guests then we > > > > need a paravirt channel which allows an "enlightened" guest to determine > > > > that it is a guest and that the hypervisor has configured wrs.nto to > > > > trap, which then indicates it's a good idea to patch wrs.sto to wrs.nto. > > > > But, adding paravirt stuff should be avoided whenever possible since it > > > > adds complexity we'd rather not maintain. > > > > > > > > > > That still wouldn't solve this issue, because the wrs.nto guest may still > > > never wakeup in the implementation-specific way? > > The paravirt approach does solve it, because wrs.nto is specified to raise > exceptions after an implementation-specific bounded time limit when the > hypervisor sets hstatus.VTW. > > > > > Thinking about this a bit more, this is a performance penalty and not a > > correctness issue. > > It's incorrect to have a design that is likely to result in bad > performance. > > > This line is an optimization that allows the lock > > holder to jump the queue if the holder directly in front is a preempted > > vcpu. Eventually the vcpu will be scheduled again and give up the lock. > > So an implementation of WRS.NTO that does not have the > > "implementation-specific bounded time limit" that the spec calls out for > > WRS.NTO to raise a virtual instruction exception, would still function, > > just slower. > > The problem isn't specific to virtualization. The problem is using wrs.nto > when it has not been configured to raise exceptions and there are not any > other guaranteed termination events (other than the reservation set > becoming invalid). While the paravirt solution is virtualization specific, > it works, because we would then only use wrs.nto when we know we can, and > otherwise use wrs.sto. But, as I said, I'd rather not have a paravirt > solution. Andrew, it would be great if you could summarize this finding to the spec authors. Maybe a non-normative text added to the spec (that raises awareness for the issue and provides a guideline to avoid it) could reduce the risk of triggering the issue on real HW. This might be enough justification to use WRS.NTO. If WRS.NTO is considered as not reliable enough to wake up and therefore causing performance issues or CPU stalls if used for the spin lock optimization, it might be also reasonable to use WRS.STO instead. The cost of having too many wakeups seems much more acceptable than a stalled CPU. BR Christoph
On Wed, Apr 24, 2024 at 11:23:00AM +0200, Christoph Müllner wrote: > On Wed, Apr 24, 2024 at 9:34 AM Andrew Jones <ajones@ventanamicro.com> wrote: > > > > On Tue, Apr 23, 2024 at 03:42:47PM -0400, Charlie Jenkins wrote: > > > On Tue, Apr 23, 2024 at 02:00:53PM -0400, Charlie Jenkins wrote: > > > > On Tue, Apr 23, 2024 at 10:46:01AM +0200, Andrew Jones wrote: > > > > > On Mon, Apr 22, 2024 at 06:36:45PM -0400, Charlie Jenkins wrote: > > > > > > On Sun, Apr 21, 2024 at 12:20:03PM +0200, Andrew Jones wrote: > > > > > > > On Fri, Apr 19, 2024 at 12:40:01PM -0400, Charlie Jenkins wrote: > > > > > ... > > > > > > > > What would be the purpose of a vendor implementing WRS.NTO (and putting > > > > > > > > it in the DT) that never terminates? The spec says "Then a subsequent > > > > > > > > WRS.NTO instruction would cause the hart to temporarily stall execution > > > > > > > > in a low- power state until a store occurs to the reservation set or an > > > > > > > > interrupt is observed." Why is this wording for WRS.NTO not sufficient > > > > > > > > to assume that an implementation of this instruction would eventually > > > > > > > > terminate? > > > > > > > > > > > > > > > > > > > > > > We can invoke smp_cond_load_relaxed(addr, VAL || anything_we_want()). This > > > > > > > means we may not expect VAL ever to be written, which rules out "until a > > > > > > > store occurs". As for "an interrupt is observed", we don't know which one > > > > > > > to expect to arrive within a "reasonable" amount of time. We need to know > > > > > > > which one(s), since, while wrs.nto will terminate even when interrupts are > > > > > > > globally disabled, we still need to have the interrupt(s) we expect to be > > > > > > > locally enabled. And, the interrupts should arrive in a "reasonable" > > > > > > > amount of time since we want to poll anything_we_want() at a "reasonable" > > > > > > > frequency. > > > > > > > > > > > > > > So, we need firmware to promise to enable exceptions if there aren't any > > > > > > > such interrupts. Or, we could require hardware descriptions to identify > > > > > > > which interrupt(s) would be good to have enabled before calling wrs.nto. > > > > > > > Maybe there's already some way to describe something like that? > > > > > > > > > > > > > > Thanks, > > > > > > > drew > > > > > > > > > > > > Ahh okay I am caught up now. So the wording we are looking at in the > > > > > > spec is: > > > > > > > > > > > > "When executing in VS or VU mode, if the VTW bit is set in hstatus, the > > > > > > TW bit in mstatus is clear, and the WRS.NTO does not complete within an > > > > > > implementation-specific bounded time limit, the WRS.NTO instruction will > > > > > > cause a virtual instruction exception." > > > > > > > > > > That's what the hypervisor should promise to do when there's no other > > > > > guarantee of wrs.nto terminating (but the hypervisor likely wants to > > > > > anyway since it wants guests to trap on wrs.nto in order to potentially > > > > > schedule the lock holding VCPU). The firmware of the host should likewise > > > > > promise to set mstatus.TW when there's no guarantee of wrs.nto > > > > > terminating, but that's likely _not_ something it normally would want to > > > > > do, so hopefully there will always be implementation-specific "other > > > > > reasons" which guarantee termination. > > > > > > > > > > > > > > > > > With the concern being that it is possible for "implementation-specific > > > > > > bounded time limit" to be infinite/never times out, > > > > > > > > > > The implementation-defined short timeout cannot be infinite, but it only > > > > > applies to wrs.sto. While using wrs.sto would relieve the concern, it > > > > > cannot be configured to raise exceptions, which means it's not useful in > > > > > guests. If we want to use wrs.sto in hosts and wrs.nto in guests then we > > > > > need a paravirt channel which allows an "enlightened" guest to determine > > > > > that it is a guest and that the hypervisor has configured wrs.nto to > > > > > trap, which then indicates it's a good idea to patch wrs.sto to wrs.nto. > > > > > But, adding paravirt stuff should be avoided whenever possible since it > > > > > adds complexity we'd rather not maintain. > > > > > > > > > > > > > That still wouldn't solve this issue, because the wrs.nto guest may still > > > > never wakeup in the implementation-specific way? > > > > The paravirt approach does solve it, because wrs.nto is specified to raise > > exceptions after an implementation-specific bounded time limit when the > > hypervisor sets hstatus.VTW. > > > > > > > > Thinking about this a bit more, this is a performance penalty and not a > > > correctness issue. > > > > It's incorrect to have a design that is likely to result in bad > > performance. > > > > > This line is an optimization that allows the lock > > > holder to jump the queue if the holder directly in front is a preempted > > > vcpu. Eventually the vcpu will be scheduled again and give up the lock. > > > So an implementation of WRS.NTO that does not have the > > > "implementation-specific bounded time limit" that the spec calls out for > > > WRS.NTO to raise a virtual instruction exception, would still function, > > > just slower. > > > > The problem isn't specific to virtualization. The problem is using wrs.nto > > when it has not been configured to raise exceptions and there are not any > > other guaranteed termination events (other than the reservation set > > becoming invalid). While the paravirt solution is virtualization specific, > > it works, because we would then only use wrs.nto when we know we can, and > > otherwise use wrs.sto. But, as I said, I'd rather not have a paravirt > > solution. > > Andrew, it would be great if you could summarize this finding to the > spec authors. > Maybe a non-normative text added to the spec (that raises awareness > for the issue Sure, I'll write something up pointing out the concern with wrs.nto and post it to a few RVI mailing lists. > and provides a guideline to avoid it) could reduce the risk of triggering > the issue on real HW. This might be enough justification to use WRS.NTO. > > If WRS.NTO is considered as not reliable enough to wake up and therefore causing > performance issues or CPU stalls if used for the spin lock optimization, > it might be also reasonable to use WRS.STO instead. > The cost of having too many wakeups seems much more acceptable than > a stalled CPU. wrs.sto is reasonable to use in all cases, since too many wakeups isn't a concern. But, we can do better in the virtualization case with wrs.nto, where the hypervisor can get involved, so, to avoid paravirt stuff, we'd like to be able to always use wrs.nto. Thanks, drew
diff --git a/Documentation/devicetree/bindings/riscv/extensions.yaml b/Documentation/devicetree/bindings/riscv/extensions.yaml index 468c646247aa..584da2f539e5 100644 --- a/Documentation/devicetree/bindings/riscv/extensions.yaml +++ b/Documentation/devicetree/bindings/riscv/extensions.yaml @@ -177,6 +177,18 @@ properties: is supported as ratified at commit 5059e0ca641c ("update to ratified") of the riscv-zacas. + - const: zawrs + description: | + The Zawrs extension for entering a low-power state or for trapping + to a hypervisor while waiting on a store to a memory location, as + ratified in commit 98918c844281 ("Merge pull request #1217 from + riscv/zawrs") of riscv-isa-manual. Linux assumes that WRS.NTO will + either always eventually terminate the stall due to the reservation + set becoming invalid, implementation-specific other reasons, or + because a higher privilege level has configured it to cause an + illegal instruction exception after an implementation-specific + bounded time limit. + - const: zba description: | The standard Zba bit-manipulation extension for address generation
Add description for the Zawrs (Wait-on-Reservation-Set) ISA extension which was ratified in commit 98918c844281 of riscv-isa-manual. Signed-off-by: Andrew Jones <ajones@ventanamicro.com> --- .../devicetree/bindings/riscv/extensions.yaml | 12 ++++++++++++ 1 file changed, 12 insertions(+)