Message ID | 1440730099-29133-4-git-send-email-boqun.feng@gmail.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
On Fri, Aug 28, 2015 at 10:48:17AM +0800, Boqun Feng wrote: > +/* > + * Since {add,sub}_return_relaxed and xchg_relaxed are implemented with > + * a "bne-" instruction at the end, so an isync is enough as a acquire barrier > + * on the platform without lwsync. > + */ > +#ifdef CONFIG_SMP > +#define smp_acquire_barrier__after_atomic() \ > + __asm__ __volatile__(PPC_ACQUIRE_BARRIER : : : "memory") > +#else > +#define smp_acquire_barrier__after_atomic() barrier() > +#endif > +#define arch_atomic_op_acquire(op, args...) \ > +({ \ > + typeof(op##_relaxed(args)) __ret = op##_relaxed(args); \ > + smp_acquire_barrier__after_atomic(); \ > + __ret; \ > +}) > + > +#define arch_atomic_op_release(op, args...) \ > +({ \ > + smp_lwsync(); \ > + op##_relaxed(args); \ > +}) Urgh, so this is RCpc. We were trying to get rid of that if possible. Lets wait until that's settled before introducing more of it. lkml.kernel.org/r/20150820155604.GB24100@arm.com
Hi Peter, On Fri, Aug 28, 2015 at 12:48:54PM +0200, Peter Zijlstra wrote: > On Fri, Aug 28, 2015 at 10:48:17AM +0800, Boqun Feng wrote: > > +/* > > + * Since {add,sub}_return_relaxed and xchg_relaxed are implemented with > > + * a "bne-" instruction at the end, so an isync is enough as a acquire barrier > > + * on the platform without lwsync. > > + */ > > +#ifdef CONFIG_SMP > > +#define smp_acquire_barrier__after_atomic() \ > > + __asm__ __volatile__(PPC_ACQUIRE_BARRIER : : : "memory") > > +#else > > +#define smp_acquire_barrier__after_atomic() barrier() > > +#endif > > +#define arch_atomic_op_acquire(op, args...) \ > > +({ \ > > + typeof(op##_relaxed(args)) __ret = op##_relaxed(args); \ > > + smp_acquire_barrier__after_atomic(); \ > > + __ret; \ > > +}) > > + > > +#define arch_atomic_op_release(op, args...) \ > > +({ \ > > + smp_lwsync(); \ > > + op##_relaxed(args); \ > > +}) > > Urgh, so this is RCpc. We were trying to get rid of that if possible. > Lets wait until that's settled before introducing more of it. > > lkml.kernel.org/r/20150820155604.GB24100@arm.com OK, get it. Thanks. So I'm not going to introduce these arch specific macros, I think what I need to implement are just _relaxed variants and cmpxchg_acquire. Regards, Boqun
On Fri, Aug 28, 2015 at 08:06:14PM +0800, Boqun Feng wrote: > Hi Peter, > > On Fri, Aug 28, 2015 at 12:48:54PM +0200, Peter Zijlstra wrote: > > On Fri, Aug 28, 2015 at 10:48:17AM +0800, Boqun Feng wrote: > > > +/* > > > + * Since {add,sub}_return_relaxed and xchg_relaxed are implemented with > > > + * a "bne-" instruction at the end, so an isync is enough as a acquire barrier > > > + * on the platform without lwsync. > > > + */ > > > +#ifdef CONFIG_SMP > > > +#define smp_acquire_barrier__after_atomic() \ > > > + __asm__ __volatile__(PPC_ACQUIRE_BARRIER : : : "memory") > > > +#else > > > +#define smp_acquire_barrier__after_atomic() barrier() > > > +#endif > > > +#define arch_atomic_op_acquire(op, args...) \ > > > +({ \ > > > + typeof(op##_relaxed(args)) __ret = op##_relaxed(args); \ > > > + smp_acquire_barrier__after_atomic(); \ > > > + __ret; \ > > > +}) > > > + > > > +#define arch_atomic_op_release(op, args...) \ > > > +({ \ > > > + smp_lwsync(); \ > > > + op##_relaxed(args); \ > > > +}) > > > > Urgh, so this is RCpc. We were trying to get rid of that if possible. > > Lets wait until that's settled before introducing more of it. > > > > lkml.kernel.org/r/20150820155604.GB24100@arm.com > > OK, get it. Thanks. > > So I'm not going to introduce these arch specific macros, I think what I > need to implement are just _relaxed variants and cmpxchg_acquire. Ah.. just read through the thread you mentioned, I might misunderstand you, probably because I didn't understand RCpc well.. You are saying that in a RELEASE we -might- switch from smp_lwsync() to smp_mb() semantically, right? I guess this means we -might- switch from RCpc to RCsc, right? If so, I think I'd better to wait until we have a conclusion for this. Thank you for your comments! Regards, Boqun
On Fri, Aug 28, 2015 at 10:16:02PM +0800, Boqun Feng wrote: > On Fri, Aug 28, 2015 at 08:06:14PM +0800, Boqun Feng wrote: > > Hi Peter, > > > > On Fri, Aug 28, 2015 at 12:48:54PM +0200, Peter Zijlstra wrote: > > > On Fri, Aug 28, 2015 at 10:48:17AM +0800, Boqun Feng wrote: > > > > +/* > > > > + * Since {add,sub}_return_relaxed and xchg_relaxed are implemented with > > > > + * a "bne-" instruction at the end, so an isync is enough as a acquire barrier > > > > + * on the platform without lwsync. > > > > + */ > > > > +#ifdef CONFIG_SMP > > > > +#define smp_acquire_barrier__after_atomic() \ > > > > + __asm__ __volatile__(PPC_ACQUIRE_BARRIER : : : "memory") > > > > +#else > > > > +#define smp_acquire_barrier__after_atomic() barrier() > > > > +#endif > > > > +#define arch_atomic_op_acquire(op, args...) \ > > > > +({ \ > > > > + typeof(op##_relaxed(args)) __ret = op##_relaxed(args); \ > > > > + smp_acquire_barrier__after_atomic(); \ > > > > + __ret; \ > > > > +}) > > > > + > > > > +#define arch_atomic_op_release(op, args...) \ > > > > +({ \ > > > > + smp_lwsync(); \ > > > > + op##_relaxed(args); \ > > > > +}) > > > > > > Urgh, so this is RCpc. We were trying to get rid of that if possible. > > > Lets wait until that's settled before introducing more of it. > > > > > > lkml.kernel.org/r/20150820155604.GB24100@arm.com > > > > OK, get it. Thanks. > > > > So I'm not going to introduce these arch specific macros, I think what I > > need to implement are just _relaxed variants and cmpxchg_acquire. > > Ah.. just read through the thread you mentioned, I might misunderstand > you, probably because I didn't understand RCpc well.. > > You are saying that in a RELEASE we -might- switch from smp_lwsync() to > smp_mb() semantically, right? I guess this means we -might- switch from > RCpc to RCsc, right? > > If so, I think I'd better to wait until we have a conclusion for this. Yes, the difference between RCpc and RCsc is in the meaning of RELEASE + ACQUIRE. With RCsc that implies a full memory barrier, with RCpc it does not. Currently PowerPC is the only arch that (can, and) does RCpc and gives a weaker RELEASE + ACQUIRE. Only the CPU who did the ACQUIRE is guaranteed to see the stores of the CPU which did the RELEASE in order. As it stands, RCU is the only _known_ codebase where this matters, but we did in fact write code for a fair number of years 'assuming' RELEASE + ACQUIRE was a full barrier, so who knows what else is out there. RCsc - release consistency sequential consistency RCpc - release consistency processor consistency https://en.wikipedia.org/wiki/Processor_consistency (where they have s/sequential/causal/)
On Fri, Aug 28, 2015 at 05:39:21PM +0200, Peter Zijlstra wrote: > On Fri, Aug 28, 2015 at 10:16:02PM +0800, Boqun Feng wrote: <snip> > > > > Ah.. just read through the thread you mentioned, I might misunderstand > > you, probably because I didn't understand RCpc well.. > > > > You are saying that in a RELEASE we -might- switch from smp_lwsync() to > > smp_mb() semantically, right? I guess this means we -might- switch from > > RCpc to RCsc, right? > > > > If so, I think I'd better to wait until we have a conclusion for this. > > Yes, the difference between RCpc and RCsc is in the meaning of RELEASE + > ACQUIRE. With RCsc that implies a full memory barrier, with RCpc it does > not. > > Currently PowerPC is the only arch that (can, and) does RCpc and gives a > weaker RELEASE + ACQUIRE. Only the CPU who did the ACQUIRE is guaranteed > to see the stores of the CPU which did the RELEASE in order. > > As it stands, RCU is the only _known_ codebase where this matters, but > we did in fact write code for a fair number of years 'assuming' RELEASE > + ACQUIRE was a full barrier, so who knows what else is out there. > > > RCsc - release consistency sequential consistency > RCpc - release consistency processor consistency > > https://en.wikipedia.org/wiki/Processor_consistency (where they have > s/sequential/causal/) Thank you for your detailed explanation! Much clear now ;-) Regards, Boqun
On Fri, Aug 28, 2015 at 04:39:21PM +0100, Peter Zijlstra wrote: > On Fri, Aug 28, 2015 at 10:16:02PM +0800, Boqun Feng wrote: > > Ah.. just read through the thread you mentioned, I might misunderstand > > you, probably because I didn't understand RCpc well.. > > > > You are saying that in a RELEASE we -might- switch from smp_lwsync() to > > smp_mb() semantically, right? I guess this means we -might- switch from > > RCpc to RCsc, right? > > > > If so, I think I'd better to wait until we have a conclusion for this. > > Yes, the difference between RCpc and RCsc is in the meaning of RELEASE + > ACQUIRE. With RCsc that implies a full memory barrier, with RCpc it does > not. We've discussed this before, but for the sake of completeness, I don't think we're fully RCsc either because we don't order the actual RELEASE operation again a subsequent ACQUIRE operation: P0 smp_store_release(&x, 1); foo = smp_load_acquire(&y); P1 smp_store_release(&y, 1); bar = smp_load_acquire(&x); We allow foo == bar == 0, which is prohibited by SC. However, we *do* enforce ordering on any prior or subsequent accesses for the code snippet above (the release and acquire combine to give a full barrier), which makes these primitives well suited to things like message passing. Will
On Tue, Sep 01, 2015 at 08:00:27PM +0100, Will Deacon wrote: > On Fri, Aug 28, 2015 at 04:39:21PM +0100, Peter Zijlstra wrote: > > On Fri, Aug 28, 2015 at 10:16:02PM +0800, Boqun Feng wrote: > > > Ah.. just read through the thread you mentioned, I might misunderstand > > > you, probably because I didn't understand RCpc well.. > > > > > > You are saying that in a RELEASE we -might- switch from smp_lwsync() to > > > smp_mb() semantically, right? I guess this means we -might- switch from > > > RCpc to RCsc, right? > > > > > > If so, I think I'd better to wait until we have a conclusion for this. > > > > Yes, the difference between RCpc and RCsc is in the meaning of RELEASE + > > ACQUIRE. With RCsc that implies a full memory barrier, with RCpc it does > > not. > > We've discussed this before, but for the sake of completeness, I don't > think we're fully RCsc either because we don't order the actual RELEASE > operation again a subsequent ACQUIRE operation: > > P0 > smp_store_release(&x, 1); > foo = smp_load_acquire(&y); > > P1 > smp_store_release(&y, 1); > bar = smp_load_acquire(&x); > > We allow foo == bar == 0, which is prohibited by SC. I certainly hope that no one expects foo == bar == 0 to be prohibited!!! On the other hand, in this case, foo == bar == 1 will be prohibited: P0 foo = smp_load_acquire(&y); smp_store_release(&x, 1); P1 bar = smp_load_acquire(&x); smp_store_release(&y, 1); > However, we *do* enforce ordering on any prior or subsequent accesses > for the code snippet above (the release and acquire combine to give a > full barrier), which makes these primitives well suited to things like > message passing. If I understand your example correctly, neither x86 nor Power implement a full barrier in this case. For example: P0 WRITE_ONCE(a, 1); smp_store_release(b, 1); r1 = smp_load_acquire(c); r2 = READ_ONCE(d); P1 WRITE_ONCE(d, 1); smp_mb(); r3 = READ_ONCE(a); Both x86 and Power can reorder P0 as follows: P0 r1 = smp_load_acquire(c); r2 = READ_ONCE(d); WRITE_ONCE(a, 1); smp_store_release(b, 1); Which clearly shows that the non-SC outcome r2 == 0 && r3 == 0 is allowed. Or am I missing your point here? Thanx, Paul
Hi Paul, On Tue, Sep 01, 2015 at 10:45:40PM +0100, Paul E. McKenney wrote: > On Tue, Sep 01, 2015 at 08:00:27PM +0100, Will Deacon wrote: > > On Fri, Aug 28, 2015 at 04:39:21PM +0100, Peter Zijlstra wrote: > > > Yes, the difference between RCpc and RCsc is in the meaning of RELEASE + > > > ACQUIRE. With RCsc that implies a full memory barrier, with RCpc it does > > > not. > > > > We've discussed this before, but for the sake of completeness, I don't > > think we're fully RCsc either because we don't order the actual RELEASE > > operation again a subsequent ACQUIRE operation: > > > > P0 > > smp_store_release(&x, 1); > > foo = smp_load_acquire(&y); > > > > P1 > > smp_store_release(&y, 1); > > bar = smp_load_acquire(&x); > > > > We allow foo == bar == 0, which is prohibited by SC. > > I certainly hope that no one expects foo == bar == 0 to be prohibited!!! I just thought it was worth making this point, because it is prohibited in SC and I don't want people to think that our RELEASE/ACQUIRE operations are SC (even though they happen to be on arm64). > On the other hand, in this case, foo == bar == 1 will be prohibited: > > P0 > foo = smp_load_acquire(&y); > smp_store_release(&x, 1); > > P1 > bar = smp_load_acquire(&x); > smp_store_release(&y, 1); Agreed. > > However, we *do* enforce ordering on any prior or subsequent accesses > > for the code snippet above (the release and acquire combine to give a > > full barrier), which makes these primitives well suited to things like > > message passing. > > If I understand your example correctly, neither x86 nor Power implement > a full barrier in this case. For example: > > P0 > WRITE_ONCE(a, 1); > smp_store_release(b, 1); > r1 = smp_load_acquire(c); > r2 = READ_ONCE(d); > > P1 > WRITE_ONCE(d, 1); > smp_mb(); > r3 = READ_ONCE(a); > > Both x86 and Power can reorder P0 as follows: > > P0 > r1 = smp_load_acquire(c); > r2 = READ_ONCE(d); > WRITE_ONCE(a, 1); > smp_store_release(b, 1); > > Which clearly shows that the non-SC outcome r2 == 0 && r3 == 0 is allowed. > > Or am I missing your point here? I think this example is slightly different. Having the RELEASE/ACQUIRE operations being reordered with respect to each other is one thing, but I thought we were heading in a direction where they combined to give a full barrier with respect to other accesses. In that case, the reordering above would be forbidden. Peter -- if the above reordering can happen on x86, then moving away from RCpc is going to be less popular than I hoped... Will
On Wed, Sep 02, 2015 at 10:59:06AM +0100, Will Deacon wrote: > Hi Paul, > > On Tue, Sep 01, 2015 at 10:45:40PM +0100, Paul E. McKenney wrote: > > On Tue, Sep 01, 2015 at 08:00:27PM +0100, Will Deacon wrote: > > > On Fri, Aug 28, 2015 at 04:39:21PM +0100, Peter Zijlstra wrote: > > > > Yes, the difference between RCpc and RCsc is in the meaning of RELEASE + > > > > ACQUIRE. With RCsc that implies a full memory barrier, with RCpc it does > > > > not. > > > > > > We've discussed this before, but for the sake of completeness, I don't > > > think we're fully RCsc either because we don't order the actual RELEASE > > > operation again a subsequent ACQUIRE operation: > > > > > > P0 > > > smp_store_release(&x, 1); > > > foo = smp_load_acquire(&y); > > > > > > P1 > > > smp_store_release(&y, 1); > > > bar = smp_load_acquire(&x); > > > > > > We allow foo == bar == 0, which is prohibited by SC. > > > > I certainly hope that no one expects foo == bar == 0 to be prohibited!!! > > I just thought it was worth making this point, because it is prohibited > in SC and I don't want people to think that our RELEASE/ACQUIRE operations > are SC (even though they happen to be on arm64). OK, good. > > On the other hand, in this case, foo == bar == 1 will be prohibited: > > > > P0 > > foo = smp_load_acquire(&y); > > smp_store_release(&x, 1); > > > > P1 > > bar = smp_load_acquire(&x); > > smp_store_release(&y, 1); > > Agreed. Good as well. > > > However, we *do* enforce ordering on any prior or subsequent accesses > > > for the code snippet above (the release and acquire combine to give a > > > full barrier), which makes these primitives well suited to things like > > > message passing. > > > > If I understand your example correctly, neither x86 nor Power implement > > a full barrier in this case. For example: > > > > P0 > > WRITE_ONCE(a, 1); > > smp_store_release(b, 1); > > r1 = smp_load_acquire(c); > > r2 = READ_ONCE(d); > > > > P1 > > WRITE_ONCE(d, 1); > > smp_mb(); > > r3 = READ_ONCE(a); > > > > Both x86 and Power can reorder P0 as follows: > > > > P0 > > r1 = smp_load_acquire(c); > > r2 = READ_ONCE(d); > > WRITE_ONCE(a, 1); > > smp_store_release(b, 1); > > > > Which clearly shows that the non-SC outcome r2 == 0 && r3 == 0 is allowed. > > > > Or am I missing your point here? > > I think this example is slightly different. Having the RELEASE/ACQUIRE > operations being reordered with respect to each other is one thing, but > I thought we were heading in a direction where they combined to give a > full barrier with respect to other accesses. In that case, the reordering > above would be forbidden. It is certainly less added overhead to make unlock-lock a full barrier than it is to make smp_store_release()-smp_load_acquire() a full barrier. I am not fully convinced on either, aside from needing some way to make unlock-lock a full barrier within the RCU implementation, for which the now-privatized smp_mb__after_unlock_lock() suffices. > Peter -- if the above reordering can happen on x86, then moving away > from RCpc is going to be less popular than I hoped... ;-) Thanx, Paul
[left the context in the hope that we can make some progress] On Wed, Sep 02, 2015 at 10:59:06AM +0100, Will Deacon wrote: > On Tue, Sep 01, 2015 at 10:45:40PM +0100, Paul E. McKenney wrote: > > On Tue, Sep 01, 2015 at 08:00:27PM +0100, Will Deacon wrote: > > > On Fri, Aug 28, 2015 at 04:39:21PM +0100, Peter Zijlstra wrote: > > > > Yes, the difference between RCpc and RCsc is in the meaning of RELEASE + > > > > ACQUIRE. With RCsc that implies a full memory barrier, with RCpc it does > > > > not. > > > > > > We've discussed this before, but for the sake of completeness, I don't > > > think we're fully RCsc either because we don't order the actual RELEASE > > > operation again a subsequent ACQUIRE operation: > > > > > > P0 > > > smp_store_release(&x, 1); > > > foo = smp_load_acquire(&y); > > > > > > P1 > > > smp_store_release(&y, 1); > > > bar = smp_load_acquire(&x); > > > > > > We allow foo == bar == 0, which is prohibited by SC. > > > > I certainly hope that no one expects foo == bar == 0 to be prohibited!!! > > I just thought it was worth making this point, because it is prohibited > in SC and I don't want people to think that our RELEASE/ACQUIRE operations > are SC (even though they happen to be on arm64). > > > On the other hand, in this case, foo == bar == 1 will be prohibited: > > > > P0 > > foo = smp_load_acquire(&y); > > smp_store_release(&x, 1); > > > > P1 > > bar = smp_load_acquire(&x); > > smp_store_release(&y, 1); > > Agreed. > > > > However, we *do* enforce ordering on any prior or subsequent accesses > > > for the code snippet above (the release and acquire combine to give a > > > full barrier), which makes these primitives well suited to things like > > > message passing. > > > > If I understand your example correctly, neither x86 nor Power implement > > a full barrier in this case. For example: > > > > P0 > > WRITE_ONCE(a, 1); > > smp_store_release(b, 1); > > r1 = smp_load_acquire(c); > > r2 = READ_ONCE(d); > > > > P1 > > WRITE_ONCE(d, 1); > > smp_mb(); > > r3 = READ_ONCE(a); > > > > Both x86 and Power can reorder P0 as follows: > > > > P0 > > r1 = smp_load_acquire(c); > > r2 = READ_ONCE(d); > > WRITE_ONCE(a, 1); > > smp_store_release(b, 1); > > > > Which clearly shows that the non-SC outcome r2 == 0 && r3 == 0 is allowed. > > > > Or am I missing your point here? > > I think this example is slightly different. Having the RELEASE/ACQUIRE > operations being reordered with respect to each other is one thing, but > I thought we were heading in a direction where they combined to give a > full barrier with respect to other accesses. In that case, the reordering > above would be forbidden. > > Peter -- if the above reordering can happen on x86, then moving away > from RCpc is going to be less popular than I hoped... Peter, any thoughts? I'm not au fait with the x86 memory model, but what Paul's saying is worrying. Will
On Fri, Sep 11, 2015 at 01:45:07PM +0100, Will Deacon wrote: > [left the context in the hope that we can make some progress] > > On Wed, Sep 02, 2015 at 10:59:06AM +0100, Will Deacon wrote: > > On Tue, Sep 01, 2015 at 10:45:40PM +0100, Paul E. McKenney wrote: > > > On Tue, Sep 01, 2015 at 08:00:27PM +0100, Will Deacon wrote: > > > > On Fri, Aug 28, 2015 at 04:39:21PM +0100, Peter Zijlstra wrote: > > > > > Yes, the difference between RCpc and RCsc is in the meaning of RELEASE + > > > > > ACQUIRE. With RCsc that implies a full memory barrier, with RCpc it does > > > > > not. > > > > > > > > We've discussed this before, but for the sake of completeness, I don't > > > > think we're fully RCsc either because we don't order the actual RELEASE > > > > operation again a subsequent ACQUIRE operation: > > > > > > > > P0 > > > > smp_store_release(&x, 1); > > > > foo = smp_load_acquire(&y); > > > > > > > > P1 > > > > smp_store_release(&y, 1); > > > > bar = smp_load_acquire(&x); > > > > > > > > We allow foo == bar == 0, which is prohibited by SC. > > > > > > I certainly hope that no one expects foo == bar == 0 to be prohibited!!! > > > > I just thought it was worth making this point, because it is prohibited > > in SC and I don't want people to think that our RELEASE/ACQUIRE operations > > are SC (even though they happen to be on arm64). > > > > > On the other hand, in this case, foo == bar == 1 will be prohibited: > > > > > > P0 > > > foo = smp_load_acquire(&y); > > > smp_store_release(&x, 1); > > > > > > P1 > > > bar = smp_load_acquire(&x); > > > smp_store_release(&y, 1); > > > > Agreed. > > > > > > However, we *do* enforce ordering on any prior or subsequent accesses > > > > for the code snippet above (the release and acquire combine to give a > > > > full barrier), which makes these primitives well suited to things like > > > > message passing. > > > > > > If I understand your example correctly, neither x86 nor Power implement > > > a full barrier in this case. For example: > > > > > > P0 > > > WRITE_ONCE(a, 1); > > > smp_store_release(b, 1); > > > r1 = smp_load_acquire(c); > > > r2 = READ_ONCE(d); > > > > > > P1 > > > WRITE_ONCE(d, 1); > > > smp_mb(); > > > r3 = READ_ONCE(a); > > > > > > Both x86 and Power can reorder P0 as follows: > > > > > > P0 > > > r1 = smp_load_acquire(c); > > > r2 = READ_ONCE(d); > > > WRITE_ONCE(a, 1); > > > smp_store_release(b, 1); > > > > > > Which clearly shows that the non-SC outcome r2 == 0 && r3 == 0 is allowed. > > > > > > Or am I missing your point here? > > > > I think this example is slightly different. Having the RELEASE/ACQUIRE > > operations being reordered with respect to each other is one thing, but > > I thought we were heading in a direction where they combined to give a > > full barrier with respect to other accesses. In that case, the reordering > > above would be forbidden. > > > > Peter -- if the above reordering can happen on x86, then moving away > > from RCpc is going to be less popular than I hoped... > > Peter, any thoughts? I'm not au fait with the x86 memory model, but what > Paul's saying is worrying. The herd tool has an x86 mode, which will allow you to double-check my scenario. This tool is described in "Herding Cats: Modelling, Simulation, Testing, and Data-mining for Weak Memory" by Alglave, Marenget, and Tautschnig. The herd tool is available at this git repository: https://github.com/herd/herdtools. Thanx, Paul
Sorry for being tardy, I had a wee spell of feeling horrible and then I procrastinated longer than I should have. On Fri, Sep 11, 2015 at 01:45:07PM +0100, Will Deacon wrote: > Peter, any thoughts? I'm not au fait with the x86 memory model, but what > Paul's saying is worrying. Right, so Paul is right -- and I completely forgot (I used to know about that). So all the TSO archs (SPARC-TSO, x86 (!OOSTORE) and s390) can do smp_load_acquire()/smp_store_release() with just barrier(), and while: smp_store_release(&x); smp_load_acquire(&x); will provide full order by means of the address dependency, smp_store_release(&x); smp_load_acquire(&y); will not. Because the one reorder TSO allows is exactly that one. > Peter -- if the above reordering can happen on x86, then moving away > from RCpc is going to be less popular than I hoped... Sadly yes.. We could of course try and split LOCK from ACQUIRE again, but I'm not sure that's going to help anything except confusion.
On Mon, Sep 14, 2015 at 01:35:20PM +0200, Peter Zijlstra wrote: > > Sorry for being tardy, I had a wee spell of feeling horrible and then I > procrastinated longer than I should have. > > On Fri, Sep 11, 2015 at 01:45:07PM +0100, Will Deacon wrote: > > > Peter, any thoughts? I'm not au fait with the x86 memory model, but what > > Paul's saying is worrying. > > Right, so Paul is right -- and I completely forgot (I used to know about > that). > > So all the TSO archs (SPARC-TSO, x86 (!OOSTORE) and s390) can do > smp_load_acquire()/smp_store_release() with just barrier(), and while: > > smp_store_release(&x); > smp_load_acquire(&x); > > will provide full order by means of the address dependency, > > smp_store_release(&x); > smp_load_acquire(&y); > > will not. Because the one reorder TSO allows is exactly that one. > > > Peter -- if the above reordering can happen on x86, then moving away > > from RCpc is going to be less popular than I hoped... > > Sadly yes.. We could of course try and split LOCK from ACQUIRE again, > but I'm not sure that's going to help anything except confusion. This of course also means we need something like: smp_mb__release_acquire() which cannot be a no-op for TSO archs. And it might even mean it needs to be the same as smp_mb__unlock_lock(), but I need to think more on this. The scenario is: CPU0 CPU1 unlock(x) smp_store_release(&x->lock, 0); unlock(y) smp_store_release(&next->lock, 1); /* next == &y */ lock(y) while (!(smp_load_acquire(&y->lock)) cpu_relax(); Where the lock does _NOT_ issue a store to acquire the lock at all. Now I don't think any of our current primitives manage this, so we should be good, but it might just be possible. And at the same time; having both: smp_mb__release_acquire() smp_mb__unlock_lock() is quite horrible, for it clearly shows a LOCK isn't quite the same as ACQUIRE :/
On Mon, Sep 14, 2015 at 02:01:53PM +0200, Peter Zijlstra wrote: > The scenario is: > > CPU0 CPU1 > > unlock(x) > smp_store_release(&x->lock, 0); > > unlock(y) > smp_store_release(&next->lock, 1); /* next == &y */ > > lock(y) > while (!(smp_load_acquire(&y->lock)) > cpu_relax(); > > > Where the lock does _NOT_ issue a store to acquire the lock at all. Now > I don't think any of our current primitives manage this, so we should be > good, but it might just be possible. So with a bit more through this seems fundamentally impossible, you always needs some stores in a lock() implementation, the above for instance needs to queue itself, otherwise CPU0 will not be able to find it etc..
On Mon, Sep 14, 2015 at 01:11:56PM +0100, Peter Zijlstra wrote: > On Mon, Sep 14, 2015 at 02:01:53PM +0200, Peter Zijlstra wrote: > > The scenario is: > > > > CPU0 CPU1 > > > > unlock(x) > > smp_store_release(&x->lock, 0); > > > > unlock(y) > > smp_store_release(&next->lock, 1); /* next == &y */ > > > > lock(y) > > while (!(smp_load_acquire(&y->lock)) > > cpu_relax(); > > > > > > Where the lock does _NOT_ issue a store to acquire the lock at all. Now > > I don't think any of our current primitives manage this, so we should be > > good, but it might just be possible. > > So with a bit more through this seems fundamentally impossible, you > always needs some stores in a lock() implementation, the above for > instance needs to queue itself, otherwise CPU0 will not be able to find > it etc.. Which brings us back round to separating LOCK/UNLOCK from ACQUIRE/RELEASE. If we say that UNLOCK(foo) -> LOCK(bar) is ordered but RELEASE(baz) -> ACQUIRE(boz) is only ordered by smp_mb__release_acquire(), then I think we're in a position where we can at least build arbitrary locks portably out of ACQUIRE/RELEASE operations, even though I don't see any users of that macro in the imminent future. I'll have a crack at some documentation. Will
On Mon, Sep 14, 2015 at 04:38:48PM +0100, Will Deacon wrote: > On Mon, Sep 14, 2015 at 01:11:56PM +0100, Peter Zijlstra wrote: > > On Mon, Sep 14, 2015 at 02:01:53PM +0200, Peter Zijlstra wrote: > > > The scenario is: > > > > > > CPU0 CPU1 > > > > > > unlock(x) > > > smp_store_release(&x->lock, 0); > > > > > > unlock(y) > > > smp_store_release(&next->lock, 1); /* next == &y */ > > > > > > lock(y) > > > while (!(smp_load_acquire(&y->lock)) > > > cpu_relax(); > > > > > > > > > Where the lock does _NOT_ issue a store to acquire the lock at all. Now > > > I don't think any of our current primitives manage this, so we should be > > > good, but it might just be possible. > > > > So with a bit more through this seems fundamentally impossible, you > > always needs some stores in a lock() implementation, the above for > > instance needs to queue itself, otherwise CPU0 will not be able to find > > it etc.. > > Which brings us back round to separating LOCK/UNLOCK from ACQUIRE/RELEASE. I believe that we do need to do this, unless we decide to have unlock-lock continue to imply only acquire and release, rather than full ordering. I believe that Mike Ellerman is working up additional benchmarking on this. Thanx, Paul > If we say that UNLOCK(foo) -> LOCK(bar) is ordered but RELEASE(baz) -> > ACQUIRE(boz) is only ordered by smp_mb__release_acquire(), then I think > we're in a position where we can at least build arbitrary locks portably > out of ACQUIRE/RELEASE operations, even though I don't see any users of > that macro in the imminent future. > > I'll have a crack at some documentation. > > Will >
diff --git a/arch/powerpc/include/asm/atomic.h b/arch/powerpc/include/asm/atomic.h index 55f106e..806ce50 100644 --- a/arch/powerpc/include/asm/atomic.h +++ b/arch/powerpc/include/asm/atomic.h @@ -12,6 +12,39 @@ #define ATOMIC_INIT(i) { (i) } +/* + * Since {add,sub}_return_relaxed and xchg_relaxed are implemented with + * a "bne-" instruction at the end, so an isync is enough as a acquire barrier + * on the platform without lwsync. + */ +#ifdef CONFIG_SMP +#define smp_acquire_barrier__after_atomic() \ + __asm__ __volatile__(PPC_ACQUIRE_BARRIER : : : "memory") +#else +#define smp_acquire_barrier__after_atomic() barrier() +#endif +#define arch_atomic_op_acquire(op, args...) \ +({ \ + typeof(op##_relaxed(args)) __ret = op##_relaxed(args); \ + smp_acquire_barrier__after_atomic(); \ + __ret; \ +}) + +#define arch_atomic_op_release(op, args...) \ +({ \ + smp_lwsync(); \ + op##_relaxed(args); \ +}) + +#define arch_atomic_op_fence(op, args...) \ +({ \ + typeof(op##_relaxed(args)) __ret; \ + smp_lwsync(); \ + __ret = op##_relaxed(args); \ + smp_mb__after_atomic(); \ + __ret; \ +}) + static __inline__ int atomic_read(const atomic_t *v) { int t; @@ -42,27 +75,27 @@ static __inline__ void atomic_##op(int a, atomic_t *v) \ : "cc"); \ } \ -#define ATOMIC_OP_RETURN(op, asm_op) \ -static __inline__ int atomic_##op##_return(int a, atomic_t *v) \ +#define ATOMIC_OP_RETURN_RELAXED(op, asm_op) \ +static inline int atomic_##op##_return_relaxed(int a, atomic_t *v) \ { \ int t; \ \ __asm__ __volatile__( \ - PPC_ATOMIC_ENTRY_BARRIER \ -"1: lwarx %0,0,%2 # atomic_" #op "_return\n" \ - #asm_op " %0,%1,%0\n" \ - PPC405_ERR77(0,%2) \ -" stwcx. %0,0,%2 \n" \ +"1: lwarx %0,0,%3 # atomic_" #op "_return_relaxed\n" \ + #asm_op " %0,%2,%0\n" \ + PPC405_ERR77(0, %3) \ +" stwcx. %0,0,%3\n" \ " bne- 1b\n" \ - PPC_ATOMIC_EXIT_BARRIER \ - : "=&r" (t) \ + : "=&r" (t), "+m" (v->counter) \ : "r" (a), "r" (&v->counter) \ - : "cc", "memory"); \ + : "cc"); \ \ return t; \ } -#define ATOMIC_OPS(op, asm_op) ATOMIC_OP(op, asm_op) ATOMIC_OP_RETURN(op, asm_op) +#define ATOMIC_OPS(op, asm_op) \ + ATOMIC_OP(op, asm_op) \ + ATOMIC_OP_RETURN_RELAXED(op, asm_op) ATOMIC_OPS(add, add) ATOMIC_OPS(sub, subf) @@ -71,8 +104,11 @@ ATOMIC_OP(and, and) ATOMIC_OP(or, or) ATOMIC_OP(xor, xor) +#define atomic_add_return_relaxed atomic_add_return_relaxed +#define atomic_sub_return_relaxed atomic_sub_return_relaxed + #undef ATOMIC_OPS -#undef ATOMIC_OP_RETURN +#undef ATOMIC_OP_RETURN_RELAXED #undef ATOMIC_OP #define atomic_add_negative(a, v) (atomic_add_return((a), (v)) < 0) @@ -285,26 +321,27 @@ static __inline__ void atomic64_##op(long a, atomic64_t *v) \ : "cc"); \ } -#define ATOMIC64_OP_RETURN(op, asm_op) \ -static __inline__ long atomic64_##op##_return(long a, atomic64_t *v) \ +#define ATOMIC64_OP_RETURN_RELAXED(op, asm_op) \ +static inline long \ +atomic64_##op##_return_relaxed(long a, atomic64_t *v) \ { \ long t; \ \ __asm__ __volatile__( \ - PPC_ATOMIC_ENTRY_BARRIER \ -"1: ldarx %0,0,%2 # atomic64_" #op "_return\n" \ - #asm_op " %0,%1,%0\n" \ -" stdcx. %0,0,%2 \n" \ +"1: ldarx %0,0,%3 # atomic64_" #op "_return_relaxed\n" \ + #asm_op " %0,%2,%0\n" \ +" stdcx. %0,0,%3\n" \ " bne- 1b\n" \ - PPC_ATOMIC_EXIT_BARRIER \ - : "=&r" (t) \ + : "=&r" (t), "+m" (v->counter) \ : "r" (a), "r" (&v->counter) \ - : "cc", "memory"); \ + : "cc"); \ \ return t; \ } -#define ATOMIC64_OPS(op, asm_op) ATOMIC64_OP(op, asm_op) ATOMIC64_OP_RETURN(op, asm_op) +#define ATOMIC64_OPS(op, asm_op) \ + ATOMIC64_OP(op, asm_op) \ + ATOMIC64_OP_RETURN_RELAXED(op, asm_op) ATOMIC64_OPS(add, add) ATOMIC64_OPS(sub, subf) @@ -312,8 +349,11 @@ ATOMIC64_OP(and, and) ATOMIC64_OP(or, or) ATOMIC64_OP(xor, xor) -#undef ATOMIC64_OPS -#undef ATOMIC64_OP_RETURN +#define atomic64_add_return_relaxed atomic64_add_return_relaxed +#define atomic64_sub_return_relaxed atomic64_sub_return_relaxed + +#undef ATOPIC64_OPS +#undef ATOMIC64_OP_RETURN_RELAXED #undef ATOMIC64_OP #define atomic64_add_negative(a, v) (atomic64_add_return((a), (v)) < 0)
On powerpc, we don't need a general memory barrier to achieve acquire and release semantics, so arch_atomic_op_{acquire,release} can be implemented using "lwsync" and "isync". For release semantics, since we only need to ensure all memory accesses that issue before must take effects before the -store- part of the atomics, "lwsync" is what we only need. On the platform without "lwsync", "sync" should be used. Therefore, smp_lwsync() is used here. For acquire semantics, "lwsync" is what we only need for the similar reason. However on the platform without "lwsync", we can use "isync" rather than "sync" as an acquire barrier. So a new kind of barrier smp_acquire_barrier__after_atomic() is introduced, which is barrier() on UP, "lwsync" if available and "isync" otherwise. For full ordered semantics, like the original ones, smp_lwsync() is put before relaxed variants and smp_mb__after_atomic() is put after. Signed-off-by: Boqun Feng <boqun.feng@gmail.com> --- arch/powerpc/include/asm/atomic.h | 88 ++++++++++++++++++++++++++++----------- 1 file changed, 64 insertions(+), 24 deletions(-)