diff mbox

[RFC,v2,3/7] powerpc: atomic: Implement atomic{, 64}_{add, sub}_return_* variants

Message ID 1442418575-12297-4-git-send-email-boqun.feng@gmail.com (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Boqun Feng Sept. 16, 2015, 3:49 p.m. UTC
On powerpc, we don't need a general memory barrier to achieve acquire and
release semantics, so __atomic_op_{acquire,release} can be implemented
using "lwsync" and "isync".

For release semantics, since we only need to ensure all memory accesses
that issue before must take effects before the -store- part of the
atomics, "lwsync" is what we only need. On the platform without
"lwsync", "sync" should be used. Therefore, smp_lwsync() is used here.

For acquire semantics, "lwsync" is what we only need for the similar
reason.  However on the platform without "lwsync", we can use "isync"
rather than "sync" as an acquire barrier. So a new kind of barrier
smp_acquire_barrier__after_atomic() is introduced, which is barrier() on
UP, "lwsync" if available and "isync" otherwise.

__atomic_op_fence is defined as smp_lwsync() + _relaxed +
smp_mb__after_atomic() to guarantee a fully order.

Implement atomic{,64}_{add,sub}_return_relaxed, and build other variants
with these helpers.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 arch/powerpc/include/asm/atomic.h | 88 ++++++++++++++++++++++++++++-----------
 1 file changed, 64 insertions(+), 24 deletions(-)

Comments

Will Deacon Sept. 18, 2015, 4:59 p.m. UTC | #1
On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> On powerpc, we don't need a general memory barrier to achieve acquire and
> release semantics, so __atomic_op_{acquire,release} can be implemented
> using "lwsync" and "isync".

I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom
of the s390 thread you linked me to before we start spreading this
further:

  https://lkml.org/lkml/2015/9/15/836

Will
Boqun Feng Sept. 19, 2015, 3:33 p.m. UTC | #2
Hi Will,

On Fri, Sep 18, 2015 at 05:59:02PM +0100, Will Deacon wrote:
> On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> > On powerpc, we don't need a general memory barrier to achieve acquire and
> > release semantics, so __atomic_op_{acquire,release} can be implemented
> > using "lwsync" and "isync".
> 
> I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom

Actually the transitivity is still guaranteed here, I think ;-)

(Before I put my reasoning, I have to admit I just learned about the
cumulativity recently, so my reasoning may be wrong. But the good thing
is that we have our POWER experts in the CCed. In case I'm wrong, they
could correct me.)

The thing is, on POWER, transitivity is implemented by a similar but
slightly different concept, cumulativity, and as said in the link:

http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html

"""
The ordering done by a memory barrier is said to be “cumulative” if it
also orders storage accesses that are performed by processors and
mechanisms other than P1, as follows.

*	A includes all applicable storage accesses by any such processor
	or mechanism that have been performed with respect to P1 before
	the memory barrier is created.

*	B includes all applicable storage accesses by any such processor
	or mechanism that are performed after a Load instruction
	executed by that processor or mechanism has returned the value
	stored by a store that is in B.
"""

Please note that the set B can be extended indefinitely without any
other cumulative barrier.

So for a RELEASE+ACQUIRE pair to a same variable, as long as the barrier
in the RELEASE operation is cumumlative, the transitivity is guaranteed.
And lwsync is cumulative, so we are fine here.


I also wrote a herd litmus to test this. Due to the tool's limitation, I
use the xchg_release and xchg_acquire to test. And since herd doesn't
support backward branching, some tricks are used here to work around:


PPC lwsync+isync-transitivity
""
{
0:r1=1; 0:r2=x; 0:r3=1; 0:r10=0 ; 0:r11=0; 0:r12=a;
1:r1=9; 1:r2=x; 1:r3=1; 1:r10=0 ; 1:r11=0; 1:r12=a;
2:r1=9; 2:r2=x; 2:r3=2; 2:r10=0 ; 2:r11=0; 2:r12=a;
}
 P0           | P1                  | P2                  ;
 stw r1,0(r2) | lwz r1,0(r2)        |                     ;
              | lwsync              | lwarx r11, r10, r12 ;
              | lwarx  r11,r10,r12  | stwcx. r3, r10, r12 ;
              | stwcx. r3,r10,r12   | bne Fail            ;
              |                     | isync               ;
              |                     | lwz r1, 0(r2)       ;
              |                     | Fail:               ;

exists
(1:r1=1 /\ 1:r11=0 /\ 2:r11=1 /\ 2:r1 = 0)


And the result of this litmus is that:

Test lwsync+isync-transitivity Allowed
States 15
1:r1=0; 1:r11=0; 2:r1=0; 2:r11=0;
1:r1=0; 1:r11=0; 2:r1=0; 2:r11=1;
1:r1=0; 1:r11=0; 2:r1=1; 2:r11=0;
1:r1=0; 1:r11=0; 2:r1=1; 2:r11=1;
1:r1=0; 1:r11=0; 2:r1=9; 2:r11=0;
1:r1=0; 1:r11=0; 2:r1=9; 2:r11=1;
1:r1=0; 1:r11=2; 2:r1=0; 2:r11=0;
1:r1=0; 1:r11=2; 2:r1=1; 2:r11=0;
1:r1=1; 1:r11=0; 2:r1=0; 2:r11=0;
1:r1=1; 1:r11=0; 2:r1=1; 2:r11=0;
1:r1=1; 1:r11=0; 2:r1=1; 2:r11=1;
1:r1=1; 1:r11=0; 2:r1=9; 2:r11=0;
1:r1=1; 1:r11=0; 2:r1=9; 2:r11=1;
1:r1=1; 1:r11=2; 2:r1=0; 2:r11=0;
1:r1=1; 1:r11=2; 2:r1=1; 2:r11=0;
No
Witnesses
Positive: 0 Negative: 29
Condition exists (1:r1=1 /\ 1:r11=0 /\ 2:r11=1 /\ 2:r1=0)
Observation lwsync+isync-transitivity Never 0 29

,which means transitivity is guaranteed.

Regards,
Boqun

> of the s390 thread you linked me to before we start spreading this
> further:
> 
>   https://lkml.org/lkml/2015/9/15/836
> 
> Will
Boqun Feng Sept. 20, 2015, 8:23 a.m. UTC | #3
On Sat, Sep 19, 2015 at 11:33:10PM +0800, Boqun Feng wrote:
> Hi Will,
> 
> On Fri, Sep 18, 2015 at 05:59:02PM +0100, Will Deacon wrote:
> > On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> > > On powerpc, we don't need a general memory barrier to achieve acquire and
> > > release semantics, so __atomic_op_{acquire,release} can be implemented
> > > using "lwsync" and "isync".
> > 
> > I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom
> 
> Actually the transitivity is still guaranteed here, I think ;-)
> 
> (Before I put my reasoning, I have to admit I just learned about the
> cumulativity recently, so my reasoning may be wrong. But the good thing
> is that we have our POWER experts in the CCed. In case I'm wrong, they
> could correct me.)
> 
> The thing is, on POWER, transitivity is implemented by a similar but
> slightly different concept, cumulativity, and as said in the link:
> 
> http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html
> 
> """
> The ordering done by a memory barrier is said to be “cumulative” if it
> also orders storage accesses that are performed by processors and
> mechanisms other than P1, as follows.
> 
> *	A includes all applicable storage accesses by any such processor
> 	or mechanism that have been performed with respect to P1 before
> 	the memory barrier is created.
> 
> *	B includes all applicable storage accesses by any such processor
> 	or mechanism that are performed after a Load instruction
> 	executed by that processor or mechanism has returned the value
> 	stored by a store that is in B.
> """
> 
> Please note that the set B can be extended indefinitely without any
> other cumulative barrier.
> 
> So for a RELEASE+ACQUIRE pair to a same variable, as long as the barrier
> in the RELEASE operation is cumumlative, the transitivity is guaranteed.
> And lwsync is cumulative, so we are fine here.
> 
> 
> I also wrote a herd litmus to test this. Due to the tool's limitation, I
> use the xchg_release and xchg_acquire to test. And since herd doesn't

Hmm.. I think I wanted to say atomic_xchg_release and
atomic_xchg_acquire here, sorry about that inaccuracy..

> support backward branching, some tricks are used here to work around:
> 

And I check again, herd does suppor backward branching, the problem is
just if we use backward branching, there will be a lot more states the
tool need to check, but it seems there are not too many in this case, so
I modify the litmus a little bit as follow:

PPC lwsync+isync-transitivity
""
{
0:r1=1; 0:r2=x; 0:r3=1; 0:r10=0 ; 0:r11=0; 0:r12=a;
1:r1=9; 1:r2=x; 1:r3=1; 1:r10=0 ; 1:r11=0; 1:r12=a;
2:r1=9; 2:r2=x; 2:r3=2; 2:r10=0 ; 2:r11=0; 2:r12=a;
}
 P0           | P1                  | P2                  ;
 stw r1,0(r2) | lwz r1,0(r2)        | Fail2:              ;
              | lwsync              | lwarx r11, r10, r12 ;
              | Fail1:              | stwcx. r3, r10, r12 ;
              | lwarx  r11,r10,r12  | bne Fail2           ;
              | stwcx. r3,r10,r12   | isync               ;
              | bne Fail1           | lwz r1, 0(r2)       ; 

exists
(1:r1=1 /\ 1:r11=0 /\ 2:r11=1 /\ 2:r1 = 0)

which is actually:

CPU 0			CPU 1				CPU 2
==============		=====================		=======================
{int x = 0, atomic_t a = ATOMIC_INIT(0)}
WRITE_ONCE(x,1);	t1 = READ_ONCE(x);		t2 = atomic_xchg_acquire(&a, 2);
			atomic_xchg_release(&a, 1);	t3 = READ_ONCE(x);

exists
(t1 == 1 && t2 == 1 && t3 == 0)


The result is still(it may take a while to get the result):

Test lwsync+isync-transitivity Allowed
States 11
1:r1=0; 1:r11=0; 2:r1=0; 2:r11=0;
1:r1=0; 1:r11=0; 2:r1=0; 2:r11=1;
1:r1=0; 1:r11=0; 2:r1=1; 2:r11=0;
1:r1=0; 1:r11=0; 2:r1=1; 2:r11=1;
1:r1=0; 1:r11=2; 2:r1=0; 2:r11=0;
1:r1=0; 1:r11=2; 2:r1=1; 2:r11=0;
1:r1=1; 1:r11=0; 2:r1=0; 2:r11=0;
1:r1=1; 1:r11=0; 2:r1=1; 2:r11=0;
1:r1=1; 1:r11=0; 2:r1=1; 2:r11=1;
1:r1=1; 1:r11=2; 2:r1=0; 2:r11=0;
1:r1=1; 1:r11=2; 2:r1=1; 2:r11=0;
Loop No
Witnesses
Positive: 0 Negative: 198
Condition exists (1:r1=1 /\ 1:r11=0 /\ 2:r11=1 /\ 2:r1=0)
Observation lwsync+isync-transitivity Never 0 198

, which means transitivity is guaranteed.


And I think it deserves more analysis based on cumulativity:

Initially, for the lwsync on P1(CPU 1), we have set A and B of the
storage accesses on the same processor which lwsync orders:

A includes:
	on CPU 1:
		lwz r1, 0(r2) // t1 = READ_ONCE(x); 

B includes:
	on CPU 1:
		lwarx  r11,r10,r12 // atomic_xchg_release();
		stwcx. r3,r10,r12

and as t1 == 1, which means before lwsync, P1 perceives the STORE of x
on CPU 0, which makes another storage access is included in A:

A now includes:
	on CPU 0:
		stw r1, 0(r) // WRITE_ONCE(x,1);
	on CPU 1:
		lwz r1, 0(r2) // t1 = READ_ONCE(x); 

B now includes:
	on CPU 1:
		lwarx  r11,r10,r12 // atomic_xchg_release();
		stwcx. r3,r10,r12

and as t2 == 1, which means on CPU 2, "lwarx  r11,r10,r12" in
atomic_xchg_acqurie() reads the value stored by "stwcx. r3,r10,r12" in
atomic_xchg_release() on CPU 1, that makes all storage accesses
performed after atomic_xchg_acquire() get included in set B:

A now includes:
	on CPU 0:
		stw r1, 0(r) // WRITE_ONCE(x,1);
	on CPU 1:
		lwz r1, 0(r2) // t1 = READ_ONCE(x); 

B now includes:
	on CPU 1:
		lwarx  r11,r10,r12 // atomic_xchg_release();
		stwcx. r3,r10,r12
	on CPU 2:
		lwz r1, 0(r2) // t3 = READ_ONCE(x);

Therefore the STORE of x on CPU 0 and the LOAD of x on CPU 2 can not be
reordered in this case, which means transitivity guaranteed.

Regards,
Boqun
Will Deacon Sept. 21, 2015, 10:24 p.m. UTC | #4
Hi Boqun,

On Sun, Sep 20, 2015 at 09:23:03AM +0100, Boqun Feng wrote:
> On Sat, Sep 19, 2015 at 11:33:10PM +0800, Boqun Feng wrote:
> > On Fri, Sep 18, 2015 at 05:59:02PM +0100, Will Deacon wrote:
> > > On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> > > > On powerpc, we don't need a general memory barrier to achieve acquire and
> > > > release semantics, so __atomic_op_{acquire,release} can be implemented
> > > > using "lwsync" and "isync".
> > > 
> > > I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom
> > 
> > Actually the transitivity is still guaranteed here, I think ;-)

The litmus test I'm thinking of is:


{
0:r2=x;
1:r2=x; 1:r5=z;
2:r2=z; 2:r4=x;
}
 P0           | P1            | P2           ;
 li r1,1      | lwz r1,0(r2)  | lwz r1,0(r2) ;
 stw r1,0(r2) | cmpw r1,r1    | cmpw r1,r1   ;
              | beq LC00      | beq  LC01    ;
              | LC00:         | LC01:        ;
              | isync         | isync        ;
              | li r4,1       | lwz r3,0(r4) ;
              | stw r4,0(r5)  |              ;
exists
(1:r1=1 /\ 2:r1=1 /\ 2:r3=0)


Which appears to be allowed. I don't think you need to worry about backwards
branches for the ctrl+isync construction (none of the current example do,
afaict).

Anyway, all the problematic cases seem to arise when we start mixing
ACQUIRE/RELEASE accesses with relaxed accesses (i.e. where an access from
one group reads from an access in the other group). It would be simplest
to say that this doesn't provide any transitivity guarantees, and that
an ACQUIRE must always read from a RELEASE if transitivity is required.

Will
Boqun Feng Sept. 21, 2015, 11:26 p.m. UTC | #5
On Mon, Sep 21, 2015 at 11:24:27PM +0100, Will Deacon wrote:
> Hi Boqun,
> 
> On Sun, Sep 20, 2015 at 09:23:03AM +0100, Boqun Feng wrote:
> > On Sat, Sep 19, 2015 at 11:33:10PM +0800, Boqun Feng wrote:
> > > On Fri, Sep 18, 2015 at 05:59:02PM +0100, Will Deacon wrote:
> > > > On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> > > > > On powerpc, we don't need a general memory barrier to achieve acquire and
> > > > > release semantics, so __atomic_op_{acquire,release} can be implemented
> > > > > using "lwsync" and "isync".
> > > > 
> > > > I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom
> > > 
> > > Actually the transitivity is still guaranteed here, I think ;-)
> 
> The litmus test I'm thinking of is:
> 
> 
> {
> 0:r2=x;
> 1:r2=x; 1:r5=z;
> 2:r2=z; 2:r4=x;
> }
>  P0           | P1            | P2           ;
>  li r1,1      | lwz r1,0(r2)  | lwz r1,0(r2) ;
>  stw r1,0(r2) | cmpw r1,r1    | cmpw r1,r1   ;
>               | beq LC00      | beq  LC01    ;
>               | LC00:         | LC01:        ;
>               | isync         | isync        ;
>               | li r4,1       | lwz r3,0(r4) ;
>               | stw r4,0(r5)  |              ;
> exists
> (1:r1=1 /\ 2:r1=1 /\ 2:r3=0)
> 
> 
> Which appears to be allowed. I don't think you need to worry about backwards
> branches for the ctrl+isync construction (none of the current example do,
> afaict).
> 

Yes.. my care of backwards branches is not quite related to the topic, I
concerned that mostly because my test is using atomic operation, and I
just want to test the exact asm code.

> Anyway, all the problematic cases seem to arise when we start mixing
> ACQUIRE/RELEASE accesses with relaxed accesses (i.e. where an access from
> one group reads from an access in the other group). It would be simplest
> to say that this doesn't provide any transitivity guarantees, and that
> an ACQUIRE must always read from a RELEASE if transitivity is required.
> 

Agreed. RELEASE alone doesn't provide transitivity and transitivity is
guaranteed only if an ACQUIRE read from a RELEASE. That's exactly the
direction which the link (https://lkml.org/lkml/2015/9/15/836) is
heading to. So I think we are fine here to use ctrl+isync here, right?

Regards,
Boqun
Boqun Feng Sept. 21, 2015, 11:37 p.m. UTC | #6
On Tue, Sep 22, 2015 at 07:26:56AM +0800, Boqun Feng wrote:
> On Mon, Sep 21, 2015 at 11:24:27PM +0100, Will Deacon wrote:
> > Hi Boqun,
> > 
> > On Sun, Sep 20, 2015 at 09:23:03AM +0100, Boqun Feng wrote:
> > > On Sat, Sep 19, 2015 at 11:33:10PM +0800, Boqun Feng wrote:
> > > > On Fri, Sep 18, 2015 at 05:59:02PM +0100, Will Deacon wrote:
> > > > > On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> > > > > > On powerpc, we don't need a general memory barrier to achieve acquire and
> > > > > > release semantics, so __atomic_op_{acquire,release} can be implemented
> > > > > > using "lwsync" and "isync".
> > > > > 
> > > > > I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom
> > > > 
> > > > Actually the transitivity is still guaranteed here, I think ;-)
> > 
> > The litmus test I'm thinking of is:
> > 
> > 
> > {
> > 0:r2=x;
> > 1:r2=x; 1:r5=z;
> > 2:r2=z; 2:r4=x;
> > }
> >  P0           | P1            | P2           ;
> >  li r1,1      | lwz r1,0(r2)  | lwz r1,0(r2) ;
> >  stw r1,0(r2) | cmpw r1,r1    | cmpw r1,r1   ;
> >               | beq LC00      | beq  LC01    ;
> >               | LC00:         | LC01:        ;
> >               | isync         | isync        ;
> >               | li r4,1       | lwz r3,0(r4) ;
> >               | stw r4,0(r5)  |              ;
> > exists
> > (1:r1=1 /\ 2:r1=1 /\ 2:r3=0)
> > 
> > 
> > Which appears to be allowed. I don't think you need to worry about backwards
> > branches for the ctrl+isync construction (none of the current example do,
> > afaict).
> > 
> 
> Yes.. my care of backwards branches is not quite related to the topic, I
> concerned that mostly because my test is using atomic operation, and I
> just want to test the exact asm code.
> 
> > Anyway, all the problematic cases seem to arise when we start mixing
> > ACQUIRE/RELEASE accesses with relaxed accesses (i.e. where an access from
> > one group reads from an access in the other group). It would be simplest
> > to say that this doesn't provide any transitivity guarantees, and that
> > an ACQUIRE must always read from a RELEASE if transitivity is required.
> > 
> 
> Agreed. RELEASE alone doesn't provide transitivity and transitivity is
          ^^^^^^^
This should be ACQUIRE...

> guaranteed only if an ACQUIRE read from a RELEASE. That's exactly the
> direction which the link (https://lkml.org/lkml/2015/9/15/836) is
> heading to. So I think we are fine here to use ctrl+isync here, right?
> 
> Regards,
> Boqun
Paul E. McKenney Sept. 22, 2015, 3:25 p.m. UTC | #7
On Tue, Sep 22, 2015 at 07:37:04AM +0800, Boqun Feng wrote:
> On Tue, Sep 22, 2015 at 07:26:56AM +0800, Boqun Feng wrote:
> > On Mon, Sep 21, 2015 at 11:24:27PM +0100, Will Deacon wrote:
> > > Hi Boqun,
> > > 
> > > On Sun, Sep 20, 2015 at 09:23:03AM +0100, Boqun Feng wrote:
> > > > On Sat, Sep 19, 2015 at 11:33:10PM +0800, Boqun Feng wrote:
> > > > > On Fri, Sep 18, 2015 at 05:59:02PM +0100, Will Deacon wrote:
> > > > > > On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> > > > > > > On powerpc, we don't need a general memory barrier to achieve acquire and
> > > > > > > release semantics, so __atomic_op_{acquire,release} can be implemented
> > > > > > > using "lwsync" and "isync".
> > > > > > 
> > > > > > I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom
> > > > > 
> > > > > Actually the transitivity is still guaranteed here, I think ;-)
> > > 
> > > The litmus test I'm thinking of is:
> > > 
> > > 
> > > {
> > > 0:r2=x;
> > > 1:r2=x; 1:r5=z;
> > > 2:r2=z; 2:r4=x;
> > > }
> > >  P0           | P1            | P2           ;
> > >  li r1,1      | lwz r1,0(r2)  | lwz r1,0(r2) ;
> > >  stw r1,0(r2) | cmpw r1,r1    | cmpw r1,r1   ;
> > >               | beq LC00      | beq  LC01    ;
> > >               | LC00:         | LC01:        ;
> > >               | isync         | isync        ;
> > >               | li r4,1       | lwz r3,0(r4) ;
> > >               | stw r4,0(r5)  |              ;
> > > exists
> > > (1:r1=1 /\ 2:r1=1 /\ 2:r3=0)
> > > 
> > > 
> > > Which appears to be allowed. I don't think you need to worry about backwards
> > > branches for the ctrl+isync construction (none of the current example do,
> > > afaict).
> > > 
> > 
> > Yes.. my care of backwards branches is not quite related to the topic, I
> > concerned that mostly because my test is using atomic operation, and I
> > just want to test the exact asm code.
> > 
> > > Anyway, all the problematic cases seem to arise when we start mixing
> > > ACQUIRE/RELEASE accesses with relaxed accesses (i.e. where an access from
> > > one group reads from an access in the other group). It would be simplest
> > > to say that this doesn't provide any transitivity guarantees, and that
> > > an ACQUIRE must always read from a RELEASE if transitivity is required.
> > > 
> > 
> > Agreed. RELEASE alone doesn't provide transitivity and transitivity is
>           ^^^^^^^
> This should be ACQUIRE...
> 
> > guaranteed only if an ACQUIRE read from a RELEASE. That's exactly the
> > direction which the link (https://lkml.org/lkml/2015/9/15/836) is
> > heading to. So I think we are fine here to use ctrl+isync here, right?

We are going to have to err on the side of strictness, that is, having
the documentation place more requirements on the developer than the union
of the hardware does.  Besides, I haven't heard any recent complaints
that memory-barriers.txt is too simple.  ;-)

							Thanx, Paul
Boqun Feng Sept. 23, 2015, 12:07 a.m. UTC | #8
On Tue, Sep 22, 2015 at 08:25:40AM -0700, Paul E. McKenney wrote:
> On Tue, Sep 22, 2015 at 07:37:04AM +0800, Boqun Feng wrote:
> > On Tue, Sep 22, 2015 at 07:26:56AM +0800, Boqun Feng wrote:
> > > On Mon, Sep 21, 2015 at 11:24:27PM +0100, Will Deacon wrote:
> > > > Hi Boqun,
> > > > 
> > > > On Sun, Sep 20, 2015 at 09:23:03AM +0100, Boqun Feng wrote:
> > > > > On Sat, Sep 19, 2015 at 11:33:10PM +0800, Boqun Feng wrote:
> > > > > > On Fri, Sep 18, 2015 at 05:59:02PM +0100, Will Deacon wrote:
> > > > > > > On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> > > > > > > > On powerpc, we don't need a general memory barrier to achieve acquire and
> > > > > > > > release semantics, so __atomic_op_{acquire,release} can be implemented
> > > > > > > > using "lwsync" and "isync".
> > > > > > > 
> > > > > > > I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom
> > > > > > 
> > > > > > Actually the transitivity is still guaranteed here, I think ;-)
> > > > 
> > > > The litmus test I'm thinking of is:
> > > > 
> > > > 
> > > > {
> > > > 0:r2=x;
> > > > 1:r2=x; 1:r5=z;
> > > > 2:r2=z; 2:r4=x;
> > > > }
> > > >  P0           | P1            | P2           ;
> > > >  li r1,1      | lwz r1,0(r2)  | lwz r1,0(r2) ;
> > > >  stw r1,0(r2) | cmpw r1,r1    | cmpw r1,r1   ;
> > > >               | beq LC00      | beq  LC01    ;
> > > >               | LC00:         | LC01:        ;
> > > >               | isync         | isync        ;
> > > >               | li r4,1       | lwz r3,0(r4) ;
> > > >               | stw r4,0(r5)  |              ;
> > > > exists
> > > > (1:r1=1 /\ 2:r1=1 /\ 2:r3=0)
> > > > 
> > > > 
> > > > Which appears to be allowed. I don't think you need to worry about backwards
> > > > branches for the ctrl+isync construction (none of the current example do,
> > > > afaict).
> > > > 
> > > 
> > > Yes.. my care of backwards branches is not quite related to the topic, I
> > > concerned that mostly because my test is using atomic operation, and I
> > > just want to test the exact asm code.
> > > 
> > > > Anyway, all the problematic cases seem to arise when we start mixing
> > > > ACQUIRE/RELEASE accesses with relaxed accesses (i.e. where an access from
> > > > one group reads from an access in the other group). It would be simplest
> > > > to say that this doesn't provide any transitivity guarantees, and that
> > > > an ACQUIRE must always read from a RELEASE if transitivity is required.
> > > > 
> > > 
> > > Agreed. RELEASE alone doesn't provide transitivity and transitivity is
> >           ^^^^^^^
> > This should be ACQUIRE...
> > 
> > > guaranteed only if an ACQUIRE read from a RELEASE. That's exactly the
> > > direction which the link (https://lkml.org/lkml/2015/9/15/836) is
> > > heading to. So I think we are fine here to use ctrl+isync here, right?
> 
> We are going to have to err on the side of strictness, that is, having
> the documentation place more requirements on the developer than the union
> of the hardware does.  Besides, I haven't heard any recent complaints
> that memory-barriers.txt is too simple.  ;-)
> 

Agreed ;-)

For atomic operations, using isync in ACQUIRE operations does gaurantee
that a pure RELEASE/ACQUIRE chain provides transitivity. So, again, I
think we are fine here to use isync in ACQUIRE atomic operations,
unless you think we need to be more strict, i.e, making ACQUIRE itself
provide transitivy?

Regards,
Boqun
Paul E. McKenney Sept. 25, 2015, 9:29 p.m. UTC | #9
On Wed, Sep 23, 2015 at 08:07:55AM +0800, Boqun Feng wrote:
> On Tue, Sep 22, 2015 at 08:25:40AM -0700, Paul E. McKenney wrote:
> > On Tue, Sep 22, 2015 at 07:37:04AM +0800, Boqun Feng wrote:
> > > On Tue, Sep 22, 2015 at 07:26:56AM +0800, Boqun Feng wrote:
> > > > On Mon, Sep 21, 2015 at 11:24:27PM +0100, Will Deacon wrote:
> > > > > Hi Boqun,
> > > > > 
> > > > > On Sun, Sep 20, 2015 at 09:23:03AM +0100, Boqun Feng wrote:
> > > > > > On Sat, Sep 19, 2015 at 11:33:10PM +0800, Boqun Feng wrote:
> > > > > > > On Fri, Sep 18, 2015 at 05:59:02PM +0100, Will Deacon wrote:
> > > > > > > > On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> > > > > > > > > On powerpc, we don't need a general memory barrier to achieve acquire and
> > > > > > > > > release semantics, so __atomic_op_{acquire,release} can be implemented
> > > > > > > > > using "lwsync" and "isync".
> > > > > > > > 
> > > > > > > > I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom
> > > > > > > 
> > > > > > > Actually the transitivity is still guaranteed here, I think ;-)
> > > > > 
> > > > > The litmus test I'm thinking of is:
> > > > > 
> > > > > 
> > > > > {
> > > > > 0:r2=x;
> > > > > 1:r2=x; 1:r5=z;
> > > > > 2:r2=z; 2:r4=x;
> > > > > }
> > > > >  P0           | P1            | P2           ;
> > > > >  li r1,1      | lwz r1,0(r2)  | lwz r1,0(r2) ;
> > > > >  stw r1,0(r2) | cmpw r1,r1    | cmpw r1,r1   ;
> > > > >               | beq LC00      | beq  LC01    ;
> > > > >               | LC00:         | LC01:        ;
> > > > >               | isync         | isync        ;
> > > > >               | li r4,1       | lwz r3,0(r4) ;
> > > > >               | stw r4,0(r5)  |              ;
> > > > > exists
> > > > > (1:r1=1 /\ 2:r1=1 /\ 2:r3=0)
> > > > > 
> > > > > 
> > > > > Which appears to be allowed. I don't think you need to worry about backwards
> > > > > branches for the ctrl+isync construction (none of the current example do,
> > > > > afaict).
> > > > > 
> > > > 
> > > > Yes.. my care of backwards branches is not quite related to the topic, I
> > > > concerned that mostly because my test is using atomic operation, and I
> > > > just want to test the exact asm code.
> > > > 
> > > > > Anyway, all the problematic cases seem to arise when we start mixing
> > > > > ACQUIRE/RELEASE accesses with relaxed accesses (i.e. where an access from
> > > > > one group reads from an access in the other group). It would be simplest
> > > > > to say that this doesn't provide any transitivity guarantees, and that
> > > > > an ACQUIRE must always read from a RELEASE if transitivity is required.
> > > > > 
> > > > 
> > > > Agreed. RELEASE alone doesn't provide transitivity and transitivity is
> > >           ^^^^^^^
> > > This should be ACQUIRE...
> > > 
> > > > guaranteed only if an ACQUIRE read from a RELEASE. That's exactly the
> > > > direction which the link (https://lkml.org/lkml/2015/9/15/836) is
> > > > heading to. So I think we are fine here to use ctrl+isync here, right?
> > 
> > We are going to have to err on the side of strictness, that is, having
> > the documentation place more requirements on the developer than the union
> > of the hardware does.  Besides, I haven't heard any recent complaints
> > that memory-barriers.txt is too simple.  ;-)
> 
> Agreed ;-)
> 
> For atomic operations, using isync in ACQUIRE operations does gaurantee
> that a pure RELEASE/ACQUIRE chain provides transitivity. So, again, I
> think we are fine here to use isync in ACQUIRE atomic operations,
> unless you think we need to be more strict, i.e, making ACQUIRE itself
> provide transitivy?

As I understand it, either isync or lwsync suffices, with the choice
depending on the hardware.  The kernel will rewrite itself at boot time
if you use the appropriate macro.  ;-)

							Thanx, Paul
Boqun Feng Sept. 26, 2015, 2:18 a.m. UTC | #10
On Fri, Sep 25, 2015 at 02:29:04PM -0700, Paul E. McKenney wrote:
> On Wed, Sep 23, 2015 at 08:07:55AM +0800, Boqun Feng wrote:
> > On Tue, Sep 22, 2015 at 08:25:40AM -0700, Paul E. McKenney wrote:
> > > On Tue, Sep 22, 2015 at 07:37:04AM +0800, Boqun Feng wrote:
> > > > On Tue, Sep 22, 2015 at 07:26:56AM +0800, Boqun Feng wrote:
> > > > > On Mon, Sep 21, 2015 at 11:24:27PM +0100, Will Deacon wrote:
> > > > > > Hi Boqun,
> > > > > > 
> > > > > > On Sun, Sep 20, 2015 at 09:23:03AM +0100, Boqun Feng wrote:
> > > > > > > On Sat, Sep 19, 2015 at 11:33:10PM +0800, Boqun Feng wrote:
> > > > > > > > On Fri, Sep 18, 2015 at 05:59:02PM +0100, Will Deacon wrote:
> > > > > > > > > On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> > > > > > > > > > On powerpc, we don't need a general memory barrier to achieve acquire and
> > > > > > > > > > release semantics, so __atomic_op_{acquire,release} can be implemented
> > > > > > > > > > using "lwsync" and "isync".
> > > > > > > > > 
> > > > > > > > > I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom
> > > > > > > > 
> > > > > > > > Actually the transitivity is still guaranteed here, I think ;-)
> > > > > > 
> > > > > > The litmus test I'm thinking of is:
> > > > > > 
> > > > > > 
> > > > > > {
> > > > > > 0:r2=x;
> > > > > > 1:r2=x; 1:r5=z;
> > > > > > 2:r2=z; 2:r4=x;
> > > > > > }
> > > > > >  P0           | P1            | P2           ;
> > > > > >  li r1,1      | lwz r1,0(r2)  | lwz r1,0(r2) ;
> > > > > >  stw r1,0(r2) | cmpw r1,r1    | cmpw r1,r1   ;
> > > > > >               | beq LC00      | beq  LC01    ;
> > > > > >               | LC00:         | LC01:        ;
> > > > > >               | isync         | isync        ;
> > > > > >               | li r4,1       | lwz r3,0(r4) ;
> > > > > >               | stw r4,0(r5)  |              ;
> > > > > > exists
> > > > > > (1:r1=1 /\ 2:r1=1 /\ 2:r3=0)
> > > > > > 
> > > > > > 
> > > > > > Which appears to be allowed. I don't think you need to worry about backwards
> > > > > > branches for the ctrl+isync construction (none of the current example do,
> > > > > > afaict).
> > > > > > 
> > > > > 
> > > > > Yes.. my care of backwards branches is not quite related to the topic, I
> > > > > concerned that mostly because my test is using atomic operation, and I
> > > > > just want to test the exact asm code.
> > > > > 
> > > > > > Anyway, all the problematic cases seem to arise when we start mixing
> > > > > > ACQUIRE/RELEASE accesses with relaxed accesses (i.e. where an access from
> > > > > > one group reads from an access in the other group). It would be simplest
> > > > > > to say that this doesn't provide any transitivity guarantees, and that
> > > > > > an ACQUIRE must always read from a RELEASE if transitivity is required.
> > > > > > 
> > > > > 
> > > > > Agreed. RELEASE alone doesn't provide transitivity and transitivity is
> > > >           ^^^^^^^
> > > > This should be ACQUIRE...
> > > > 
> > > > > guaranteed only if an ACQUIRE read from a RELEASE. That's exactly the
> > > > > direction which the link (https://lkml.org/lkml/2015/9/15/836) is
> > > > > heading to. So I think we are fine here to use ctrl+isync here, right?
> > > 
> > > We are going to have to err on the side of strictness, that is, having
> > > the documentation place more requirements on the developer than the union
> > > of the hardware does.  Besides, I haven't heard any recent complaints
> > > that memory-barriers.txt is too simple.  ;-)
> > 
> > Agreed ;-)
> > 
> > For atomic operations, using isync in ACQUIRE operations does gaurantee
> > that a pure RELEASE/ACQUIRE chain provides transitivity. So, again, I
> > think we are fine here to use isync in ACQUIRE atomic operations,
> > unless you think we need to be more strict, i.e, making ACQUIRE itself
> > provide transitivy?
> 
> As I understand it, either isync or lwsync suffices, with the choice
> depending on the hardware.  The kernel will rewrite itself at boot time
> if you use the appropriate macro.  ;-)
> 

Yep ;-)

Thank you and Will both for your comments. To be honest, I just began to
learn about these transitivity and cumulativity things recently. That's
a lot of fun to discuss these with you, Peter and Will, and the
herdtools you suggested and the N2745 document you wrote really helped a
lot. Thank you again!

Regards,
Boqun
diff mbox

Patch

diff --git a/arch/powerpc/include/asm/atomic.h b/arch/powerpc/include/asm/atomic.h
index 55f106e..d1dcdcf 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -12,6 +12,39 @@ 
 
 #define ATOMIC_INIT(i)		{ (i) }
 
+/*
+ * Since {add,sub}_return_relaxed and xchg_relaxed are implemented with
+ * a "bne-" instruction at the end, so an isync is enough as a acquire barrier
+ * on the platform without lwsync.
+ */
+#ifdef CONFIG_SMP
+#define smp_acquire_barrier__after_atomic() \
+	__asm__ __volatile__(PPC_ACQUIRE_BARRIER : : : "memory")
+#else
+#define smp_acquire_barrier__after_atomic() barrier()
+#endif
+#define __atomic_op_acquire(op, args...)				\
+({									\
+	typeof(op##_relaxed(args)) __ret  = op##_relaxed(args);		\
+	smp_acquire_barrier__after_atomic();				\
+	__ret;								\
+})
+
+#define __atomic_op_release(op, args...)				\
+({									\
+	smp_lwsync();							\
+	op##_relaxed(args);						\
+})
+
+#define __atomic_op_fence(op, args...)				\
+({									\
+	typeof(op##_relaxed(args)) __ret;				\
+	smp_lwsync();							\
+	__ret = op##_relaxed(args);					\
+	smp_mb__after_atomic();						\
+	__ret;								\
+})
+
 static __inline__ int atomic_read(const atomic_t *v)
 {
 	int t;
@@ -42,27 +75,27 @@  static __inline__ void atomic_##op(int a, atomic_t *v)			\
 	: "cc");							\
 }									\
 
-#define ATOMIC_OP_RETURN(op, asm_op)					\
-static __inline__ int atomic_##op##_return(int a, atomic_t *v)		\
+#define ATOMIC_OP_RETURN_RELAXED(op, asm_op)				\
+static inline int atomic_##op##_return_relaxed(int a, atomic_t *v)	\
 {									\
 	int t;								\
 									\
 	__asm__ __volatile__(						\
-	PPC_ATOMIC_ENTRY_BARRIER					\
-"1:	lwarx	%0,0,%2		# atomic_" #op "_return\n"		\
-	#asm_op " %0,%1,%0\n"						\
-	PPC405_ERR77(0,%2)						\
-"	stwcx.	%0,0,%2 \n"						\
+"1:	lwarx	%0,0,%3		# atomic_" #op "_return_relaxed\n"	\
+	#asm_op " %0,%2,%0\n"						\
+	PPC405_ERR77(0, %3)						\
+"	stwcx.	%0,0,%3\n"						\
 "	bne-	1b\n"							\
-	PPC_ATOMIC_EXIT_BARRIER						\
-	: "=&r" (t)							\
+	: "=&r" (t), "+m" (v->counter)					\
 	: "r" (a), "r" (&v->counter)					\
-	: "cc", "memory");						\
+	: "cc");							\
 									\
 	return t;							\
 }
 
-#define ATOMIC_OPS(op, asm_op) ATOMIC_OP(op, asm_op) ATOMIC_OP_RETURN(op, asm_op)
+#define ATOMIC_OPS(op, asm_op)						\
+	ATOMIC_OP(op, asm_op)						\
+	ATOMIC_OP_RETURN_RELAXED(op, asm_op)
 
 ATOMIC_OPS(add, add)
 ATOMIC_OPS(sub, subf)
@@ -71,8 +104,11 @@  ATOMIC_OP(and, and)
 ATOMIC_OP(or, or)
 ATOMIC_OP(xor, xor)
 
+#define atomic_add_return_relaxed atomic_add_return_relaxed
+#define atomic_sub_return_relaxed atomic_sub_return_relaxed
+
 #undef ATOMIC_OPS
-#undef ATOMIC_OP_RETURN
+#undef ATOMIC_OP_RETURN_RELAXED
 #undef ATOMIC_OP
 
 #define atomic_add_negative(a, v)	(atomic_add_return((a), (v)) < 0)
@@ -285,26 +321,27 @@  static __inline__ void atomic64_##op(long a, atomic64_t *v)		\
 	: "cc");							\
 }
 
-#define ATOMIC64_OP_RETURN(op, asm_op)					\
-static __inline__ long atomic64_##op##_return(long a, atomic64_t *v)	\
+#define ATOMIC64_OP_RETURN_RELAXED(op, asm_op)				\
+static inline long							\
+atomic64_##op##_return_relaxed(long a, atomic64_t *v)			\
 {									\
 	long t;								\
 									\
 	__asm__ __volatile__(						\
-	PPC_ATOMIC_ENTRY_BARRIER					\
-"1:	ldarx	%0,0,%2		# atomic64_" #op "_return\n"		\
-	#asm_op " %0,%1,%0\n"						\
-"	stdcx.	%0,0,%2 \n"						\
+"1:	ldarx	%0,0,%3		# atomic64_" #op "_return_relaxed\n"	\
+	#asm_op " %0,%2,%0\n"						\
+"	stdcx.	%0,0,%3\n"						\
 "	bne-	1b\n"							\
-	PPC_ATOMIC_EXIT_BARRIER						\
-	: "=&r" (t)							\
+	: "=&r" (t), "+m" (v->counter)					\
 	: "r" (a), "r" (&v->counter)					\
-	: "cc", "memory");						\
+	: "cc");							\
 									\
 	return t;							\
 }
 
-#define ATOMIC64_OPS(op, asm_op) ATOMIC64_OP(op, asm_op) ATOMIC64_OP_RETURN(op, asm_op)
+#define ATOMIC64_OPS(op, asm_op)					\
+	ATOMIC64_OP(op, asm_op)						\
+	ATOMIC64_OP_RETURN_RELAXED(op, asm_op)
 
 ATOMIC64_OPS(add, add)
 ATOMIC64_OPS(sub, subf)
@@ -312,8 +349,11 @@  ATOMIC64_OP(and, and)
 ATOMIC64_OP(or, or)
 ATOMIC64_OP(xor, xor)
 
-#undef ATOMIC64_OPS
-#undef ATOMIC64_OP_RETURN
+#define atomic64_add_return_relaxed atomic64_add_return_relaxed
+#define atomic64_sub_return_relaxed atomic64_sub_return_relaxed
+
+#undef ATOPIC64_OPS
+#undef ATOMIC64_OP_RETURN_RELAXED
 #undef ATOMIC64_OP
 
 #define atomic64_add_negative(a, v)	(atomic64_add_return((a), (v)) < 0)