diff mbox series

[v2] arch/powerpc/include/asm/barrier.h: redefine rmb and wmb to lwsync

Message ID 20230222090344.189270-1-kconsul@linux.vnet.ibm.com (mailing list archive)
State Rejected, archived
Headers show
Series [v2] arch/powerpc/include/asm/barrier.h: redefine rmb and wmb to lwsync | expand

Checks

Context Check Description
snowpatch_ozlabs/github-powerpc_selftests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_ppctests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_sparse success Successfully ran 4 jobs.
snowpatch_ozlabs/github-powerpc_clang success Successfully ran 6 jobs.
snowpatch_ozlabs/github-powerpc_kernel_qemu success Successfully ran 24 jobs.

Commit Message

Kautuk Consul Feb. 22, 2023, 9:03 a.m. UTC
A link from ibm.com states:
"Ensures that all instructions preceding the call to __lwsync
 complete before any subsequent store instructions can be executed
 on the processor that executed the function. Also, it ensures that
 all load instructions preceding the call to __lwsync complete before
 any subsequent load instructions can be executed on the processor
 that executed the function. This allows you to synchronize between
 multiple processors with minimal performance impact, as __lwsync
 does not wait for confirmation from each processor."

Thats why smp_rmb() and smp_wmb() are defined to lwsync.
But this same understanding applies to parallel pipeline
execution on each PowerPC processor.
So, use the lwsync instruction for rmb() and wmb() on the PPC
architectures that support it.

Signed-off-by: Kautuk Consul <kconsul@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/barrier.h | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Kautuk Consul Feb. 22, 2023, 9:30 a.m. UTC | #1
Again, could some IBM/non-IBM employees do basic sanity kernel load
testing on PPC64 UP and SMP systems for this patch?
would deeply appreciate it! :-)

Thanks again!
Christophe Leroy Feb. 22, 2023, 9:43 a.m. UTC | #2
Le 22/02/2023 à 10:03, Kautuk Consul a écrit :
> A link from ibm.com states:
> "Ensures that all instructions preceding the call to __lwsync
>   complete before any subsequent store instructions can be executed
>   on the processor that executed the function. Also, it ensures that
>   all load instructions preceding the call to __lwsync complete before
>   any subsequent load instructions can be executed on the processor
>   that executed the function. This allows you to synchronize between
>   multiple processors with minimal performance impact, as __lwsync
>   does not wait for confirmation from each processor."
> 
> Thats why smp_rmb() and smp_wmb() are defined to lwsync.
> But this same understanding applies to parallel pipeline
> execution on each PowerPC processor.
> So, use the lwsync instruction for rmb() and wmb() on the PPC
> architectures that support it.
> 
> Signed-off-by: Kautuk Consul <kconsul@linux.vnet.ibm.com>

Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>

> ---
>   arch/powerpc/include/asm/barrier.h | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> index b95b666f0374..e088dacc0ee8 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -36,8 +36,15 @@
>    * heavy-weight sync, so smp_wmb() can be a lighter-weight eieio.
>    */
>   #define __mb()   __asm__ __volatile__ ("sync" : : : "memory")
> +
> +/* The sub-arch has lwsync. */
> +#if defined(CONFIG_PPC64) || defined(CONFIG_PPC_E500MC)
> +#define __rmb() __asm__ __volatile__ ("lwsync" : : : "memory")
> +#define __wmb() __asm__ __volatile__ ("lwsync" : : : "memory")

I'd have preferred with 'asm volatile' though.

> +#else
>   #define __rmb()  __asm__ __volatile__ ("sync" : : : "memory")
>   #define __wmb()  __asm__ __volatile__ ("sync" : : : "memory")
> +#endif
>   
>   /* The sub-arch has lwsync */
>   #if defined(CONFIG_PPC64) || defined(CONFIG_PPC_E500MC)
Christophe Leroy Feb. 22, 2023, 9:44 a.m. UTC | #3
Le 22/02/2023 à 10:30, Kautuk Consul a écrit :
> Again, could some IBM/non-IBM employees do basic sanity kernel load
> testing on PPC64 UP and SMP systems for this patch?
> would deeply appreciate it! :-)

And can 'non-IBM' 'non employees' do something ? :)

> 
> Thanks again!
> 

Did you try on QEMU ?
Kautuk Consul Feb. 22, 2023, 9:46 a.m. UTC | #4
> 
> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Thanks!
> 
> > ---
> >   arch/powerpc/include/asm/barrier.h | 7 +++++++
> >   1 file changed, 7 insertions(+)
> > 
> > diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> > index b95b666f0374..e088dacc0ee8 100644
> > --- a/arch/powerpc/include/asm/barrier.h
> > +++ b/arch/powerpc/include/asm/barrier.h
> > @@ -36,8 +36,15 @@
> >    * heavy-weight sync, so smp_wmb() can be a lighter-weight eieio.
> >    */
> >   #define __mb()   __asm__ __volatile__ ("sync" : : : "memory")
> > +
> > +/* The sub-arch has lwsync. */
> > +#if defined(CONFIG_PPC64) || defined(CONFIG_PPC_E500MC)
> > +#define __rmb() __asm__ __volatile__ ("lwsync" : : : "memory")
> > +#define __wmb() __asm__ __volatile__ ("lwsync" : : : "memory")
> 
> I'd have preferred with 'asm volatile' though.
Sorry about that! That wasn't the intent of this patch.
Probably another patch series should change this manner of #defining
assembly.
Kautuk Consul Feb. 22, 2023, 9:49 a.m. UTC | #5
On Wed, Feb 22, 2023 at 09:44:54AM +0000, Christophe Leroy wrote:
> 
> 
> Le 22/02/2023 à 10:30, Kautuk Consul a écrit :
> > Again, could some IBM/non-IBM employees do basic sanity kernel load
> > testing on PPC64 UP and SMP systems for this patch?
> > would deeply appreciate it! :-)
> 
> And can 'non-IBM' 'non employees' do something ? :)
Anyone can help! :-)
> 
> > 
> > Thanks again!
> > 
> 
> Did you try on QEMU ?
I am a new IBM employee so I don't have any setup with cross-compiler
and rootfs, etc. with me at the moment.
Thats why I requested this from the mailing list.
Christophe Leroy Feb. 22, 2023, 9:51 a.m. UTC | #6
Le 22/02/2023 à 10:46, Kautuk Consul a écrit :
>>
>> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> Thanks!
>>
>>> ---
>>>    arch/powerpc/include/asm/barrier.h | 7 +++++++
>>>    1 file changed, 7 insertions(+)
>>>
>>> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
>>> index b95b666f0374..e088dacc0ee8 100644
>>> --- a/arch/powerpc/include/asm/barrier.h
>>> +++ b/arch/powerpc/include/asm/barrier.h
>>> @@ -36,8 +36,15 @@
>>>     * heavy-weight sync, so smp_wmb() can be a lighter-weight eieio.
>>>     */
>>>    #define __mb()   __asm__ __volatile__ ("sync" : : : "memory")
>>> +
>>> +/* The sub-arch has lwsync. */
>>> +#if defined(CONFIG_PPC64) || defined(CONFIG_PPC_E500MC)
>>> +#define __rmb() __asm__ __volatile__ ("lwsync" : : : "memory")
>>> +#define __wmb() __asm__ __volatile__ ("lwsync" : : : "memory")
>>
>> I'd have preferred with 'asm volatile' though.
> Sorry about that! That wasn't the intent of this patch.
> Probably another patch series should change this manner of #defining
> assembly.

Why adding new line wrong then have to have another patch to make them 
right ?

When you build a new house in an old village, you first build your house 
with old materials and then you replace everything with new material ?
Kautuk Consul Feb. 22, 2023, 9:55 a.m. UTC | #7
> >> I'd have preferred with 'asm volatile' though.
> > Sorry about that! That wasn't the intent of this patch.
> > Probably another patch series should change this manner of #defining
> > assembly.
> 
> Why adding new line wrong then have to have another patch to make them 
> right ?
> 
> When you build a new house in an old village, you first build your house 
> with old materials and then you replace everything with new material ?
Sorry Christophe. I will take care next time to adhere to new
conventions.
Paul E. McKenney Feb. 22, 2023, 5:47 p.m. UTC | #8
On Wed, Feb 22, 2023 at 02:33:44PM +0530, Kautuk Consul wrote:
> A link from ibm.com states:
> "Ensures that all instructions preceding the call to __lwsync
>  complete before any subsequent store instructions can be executed
>  on the processor that executed the function. Also, it ensures that
>  all load instructions preceding the call to __lwsync complete before
>  any subsequent load instructions can be executed on the processor
>  that executed the function. This allows you to synchronize between
>  multiple processors with minimal performance impact, as __lwsync
>  does not wait for confirmation from each processor."
> 
> Thats why smp_rmb() and smp_wmb() are defined to lwsync.
> But this same understanding applies to parallel pipeline
> execution on each PowerPC processor.
> So, use the lwsync instruction for rmb() and wmb() on the PPC
> architectures that support it.
> 
> Signed-off-by: Kautuk Consul <kconsul@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/barrier.h | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> index b95b666f0374..e088dacc0ee8 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -36,8 +36,15 @@
>   * heavy-weight sync, so smp_wmb() can be a lighter-weight eieio.
>   */
>  #define __mb()   __asm__ __volatile__ ("sync" : : : "memory")
> +
> +/* The sub-arch has lwsync. */
> +#if defined(CONFIG_PPC64) || defined(CONFIG_PPC_E500MC)
> +#define __rmb() __asm__ __volatile__ ("lwsync" : : : "memory")
> +#define __wmb() __asm__ __volatile__ ("lwsync" : : : "memory")

Hmmm...

Does the lwsync instruction now order both cached and uncached accesses?
Or have there been changes so that smp_rmb() and smp_wmb() get this
definition, while rmb() and wmb() still get the sync instruction?
(Not seeing this, but I could easily be missing something.)

							Thanx, Paul

> +#else
>  #define __rmb()  __asm__ __volatile__ ("sync" : : : "memory")
>  #define __wmb()  __asm__ __volatile__ ("sync" : : : "memory")
> +#endif
>  
>  /* The sub-arch has lwsync */
>  #if defined(CONFIG_PPC64) || defined(CONFIG_PPC_E500MC)
> -- 
> 2.31.1
>
Michael Ellerman Feb. 23, 2023, 3:51 a.m. UTC | #9
Hi Paul,

"Paul E. McKenney" <paulmck@kernel.org> writes:
> On Wed, Feb 22, 2023 at 02:33:44PM +0530, Kautuk Consul wrote:
>> A link from ibm.com states:
>> "Ensures that all instructions preceding the call to __lwsync
>>  complete before any subsequent store instructions can be executed
>>  on the processor that executed the function. Also, it ensures that
>>  all load instructions preceding the call to __lwsync complete before
>>  any subsequent load instructions can be executed on the processor
>>  that executed the function. This allows you to synchronize between
>>  multiple processors with minimal performance impact, as __lwsync
>>  does not wait for confirmation from each processor."
>> 
>> Thats why smp_rmb() and smp_wmb() are defined to lwsync.
>> But this same understanding applies to parallel pipeline
>> execution on each PowerPC processor.
>> So, use the lwsync instruction for rmb() and wmb() on the PPC
>> architectures that support it.
>> 
>> Signed-off-by: Kautuk Consul <kconsul@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/include/asm/barrier.h | 7 +++++++
>>  1 file changed, 7 insertions(+)
>> 
>> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
>> index b95b666f0374..e088dacc0ee8 100644
>> --- a/arch/powerpc/include/asm/barrier.h
>> +++ b/arch/powerpc/include/asm/barrier.h
>> @@ -36,8 +36,15 @@
>>   * heavy-weight sync, so smp_wmb() can be a lighter-weight eieio.
>>   */
>>  #define __mb()   __asm__ __volatile__ ("sync" : : : "memory")
>> +
>> +/* The sub-arch has lwsync. */
>> +#if defined(CONFIG_PPC64) || defined(CONFIG_PPC_E500MC)
>> +#define __rmb() __asm__ __volatile__ ("lwsync" : : : "memory")
>> +#define __wmb() __asm__ __volatile__ ("lwsync" : : : "memory")
>
> Hmmm...
>
> Does the lwsync instruction now order both cached and uncached accesses?

No.

> Or have there been changes so that smp_rmb() and smp_wmb() get this
> definition, while rmb() and wmb() still get the sync instruction?

No.

They come from:

include/asm-generic/barrier.h:#define rmb()        do { kcsan_rmb(); __rmb(); } while (0)
include/asm-generic/barrier.h:#define wmb()        do { kcsan_wmb(); __wmb(); } while (0)

> (Not seeing this, but I could easily be missing something.)

You are correct, the patch is wrong because it fails to account for IO
accesses.

Kautuk, I'm not sure what motivated you to look at these barriers, was
it just the documentation you linked to?

Maybe we need some better documentation in the kernel explaining why we
use the barriers we do?

Although there's already a pretty decent comment above the definitions,
but maybe it needs further clarification:

  /*
   * Memory barrier.
   * The sync instruction guarantees that all memory accesses initiated
   * by this processor have been performed (with respect to all other
   * mechanisms that access memory).  The eieio instruction is a barrier
   * providing an ordering (separately) for (a) cacheable stores and (b)
   * loads and stores to non-cacheable memory (e.g. I/O devices).
   *
   * mb() prevents loads and stores being reordered across this point.
   * rmb() prevents loads being reordered across this point.
   * wmb() prevents stores being reordered across this point.
   *
   * *mb() variants without smp_ prefix must order all types of memory
   * operations with one another. sync is the only instruction sufficient
   * to do this.
   *
   * For the smp_ barriers, ordering is for cacheable memory operations
   * only. We have to use the sync instruction for smp_mb(), since lwsync
   * doesn't order loads with respect to previous stores.  Lwsync can be
   * used for smp_rmb() and smp_wmb().


cheers
Kautuk Consul Feb. 23, 2023, 4:01 a.m. UTC | #10
On 2023-02-22 09:47:19, Paul E. McKenney wrote:
> On Wed, Feb 22, 2023 at 02:33:44PM +0530, Kautuk Consul wrote:
> > A link from ibm.com states:
> > "Ensures that all instructions preceding the call to __lwsync
> >  complete before any subsequent store instructions can be executed
> >  on the processor that executed the function. Also, it ensures that
> >  all load instructions preceding the call to __lwsync complete before
> >  any subsequent load instructions can be executed on the processor
> >  that executed the function. This allows you to synchronize between
> >  multiple processors with minimal performance impact, as __lwsync
> >  does not wait for confirmation from each processor."
> > 
> > Thats why smp_rmb() and smp_wmb() are defined to lwsync.
> > But this same understanding applies to parallel pipeline
> > execution on each PowerPC processor.
> > So, use the lwsync instruction for rmb() and wmb() on the PPC
> > architectures that support it.
> > 
> > Signed-off-by: Kautuk Consul <kconsul@linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/include/asm/barrier.h | 7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> > index b95b666f0374..e088dacc0ee8 100644
> > --- a/arch/powerpc/include/asm/barrier.h
> > +++ b/arch/powerpc/include/asm/barrier.h
> > @@ -36,8 +36,15 @@
> >   * heavy-weight sync, so smp_wmb() can be a lighter-weight eieio.
> >   */
> >  #define __mb()   __asm__ __volatile__ ("sync" : : : "memory")
> > +
> > +/* The sub-arch has lwsync. */
> > +#if defined(CONFIG_PPC64) || defined(CONFIG_PPC_E500MC)
> > +#define __rmb() __asm__ __volatile__ ("lwsync" : : : "memory")
> > +#define __wmb() __asm__ __volatile__ ("lwsync" : : : "memory")
> 
> Hmmm...
> 
> Does the lwsync instruction now order both cached and uncached accesses?
> Or have there been changes so that smp_rmb() and smp_wmb() get this
> definition, while rmb() and wmb() still get the sync instruction?
> (Not seeing this, but I could easily be missing something.)
> 
> 							Thanx, Paul
Upfront I don't see any documentation that states that lwsync
distinguishes between cached and uncached accesses.
That's why I requested the mailing list for test results with
kernel load testing.
> 
> > +#else
> >  #define __rmb()  __asm__ __volatile__ ("sync" : : : "memory")
> >  #define __wmb()  __asm__ __volatile__ ("sync" : : : "memory")
> > +#endif
> >  
> >  /* The sub-arch has lwsync */
> >  #if defined(CONFIG_PPC64) || defined(CONFIG_PPC_E500MC)
> > -- 
> > 2.31.1
> >
Kautuk Consul Feb. 23, 2023, 4:07 a.m. UTC | #11
On 2023-02-23 14:51:25, Michael Ellerman wrote:
> Hi Paul,
> 
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> > On Wed, Feb 22, 2023 at 02:33:44PM +0530, Kautuk Consul wrote:
> >> A link from ibm.com states:
> >> "Ensures that all instructions preceding the call to __lwsync
> >>  complete before any subsequent store instructions can be executed
> >>  on the processor that executed the function. Also, it ensures that
> >>  all load instructions preceding the call to __lwsync complete before
> >>  any subsequent load instructions can be executed on the processor
> >>  that executed the function. This allows you to synchronize between
> >>  multiple processors with minimal performance impact, as __lwsync
> >>  does not wait for confirmation from each processor."
> >> 
> >> Thats why smp_rmb() and smp_wmb() are defined to lwsync.
> >> But this same understanding applies to parallel pipeline
> >> execution on each PowerPC processor.
> >> So, use the lwsync instruction for rmb() and wmb() on the PPC
> >> architectures that support it.
> >> 
> >> Signed-off-by: Kautuk Consul <kconsul@linux.vnet.ibm.com>
> >> ---
> >>  arch/powerpc/include/asm/barrier.h | 7 +++++++
> >>  1 file changed, 7 insertions(+)
> >> 
> >> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> >> index b95b666f0374..e088dacc0ee8 100644
> >> --- a/arch/powerpc/include/asm/barrier.h
> >> +++ b/arch/powerpc/include/asm/barrier.h
> >> @@ -36,8 +36,15 @@
> >>   * heavy-weight sync, so smp_wmb() can be a lighter-weight eieio.
> >>   */
> >>  #define __mb()   __asm__ __volatile__ ("sync" : : : "memory")
> >> +
> >> +/* The sub-arch has lwsync. */
> >> +#if defined(CONFIG_PPC64) || defined(CONFIG_PPC_E500MC)
> >> +#define __rmb() __asm__ __volatile__ ("lwsync" : : : "memory")
> >> +#define __wmb() __asm__ __volatile__ ("lwsync" : : : "memory")
> >
> > Hmmm...
> >
> > Does the lwsync instruction now order both cached and uncached accesses?
> 
> No.
> 
> > Or have there been changes so that smp_rmb() and smp_wmb() get this
> > definition, while rmb() and wmb() still get the sync instruction?
> 
> No.
> 
> They come from:
> 
> include/asm-generic/barrier.h:#define rmb()        do { kcsan_rmb(); __rmb(); } while (0)
> include/asm-generic/barrier.h:#define wmb()        do { kcsan_wmb(); __wmb(); } while (0)
> 
> > (Not seeing this, but I could easily be missing something.)
> 
> You are correct, the patch is wrong because it fails to account for IO
> accesses.
> 
> Kautuk, I'm not sure what motivated you to look at these barriers, was
> it just the documentation you linked to?
> 
> Maybe we need some better documentation in the kernel explaining why we
> use the barriers we do?
> 
> Although there's already a pretty decent comment above the definitions,
> but maybe it needs further clarification:
> 
>   /*
>    * Memory barrier.
>    * The sync instruction guarantees that all memory accesses initiated
>    * by this processor have been performed (with respect to all other
>    * mechanisms that access memory).  The eieio instruction is a barrier
>    * providing an ordering (separately) for (a) cacheable stores and (b)
>    * loads and stores to non-cacheable memory (e.g. I/O devices).
>    *
>    * mb() prevents loads and stores being reordered across this point.
>    * rmb() prevents loads being reordered across this point.
>    * wmb() prevents stores being reordered across this point.
>    *
>    * *mb() variants without smp_ prefix must order all types of memory
>    * operations with one another. sync is the only instruction sufficient
>    * to do this.
>    *
>    * For the smp_ barriers, ordering is for cacheable memory operations
>    * only. We have to use the sync instruction for smp_mb(), since lwsync
>    * doesn't order loads with respect to previous stores.  Lwsync can be
>    * used for smp_rmb() and smp_wmb().

Sorry I didn't change the comment.
My point is: Is the "sync is the only instruction sufficient to do this"
comment completely correct?
As I mentioned in my reply to Paul there I didn't find any
documentation that up-front states (in the differences between sync and
lwsync) that lwsync distinguishes between cached and unchached accesses.
> 
> 
> cheers
Paul E. McKenney Feb. 23, 2023, 4:16 a.m. UTC | #12
On Thu, Feb 23, 2023 at 09:31:48AM +0530, Kautuk Consul wrote:
> On 2023-02-22 09:47:19, Paul E. McKenney wrote:
> > On Wed, Feb 22, 2023 at 02:33:44PM +0530, Kautuk Consul wrote:
> > > A link from ibm.com states:
> > > "Ensures that all instructions preceding the call to __lwsync
> > >  complete before any subsequent store instructions can be executed
> > >  on the processor that executed the function. Also, it ensures that
> > >  all load instructions preceding the call to __lwsync complete before
> > >  any subsequent load instructions can be executed on the processor
> > >  that executed the function. This allows you to synchronize between
> > >  multiple processors with minimal performance impact, as __lwsync
> > >  does not wait for confirmation from each processor."
> > > 
> > > Thats why smp_rmb() and smp_wmb() are defined to lwsync.
> > > But this same understanding applies to parallel pipeline
> > > execution on each PowerPC processor.
> > > So, use the lwsync instruction for rmb() and wmb() on the PPC
> > > architectures that support it.
> > > 
> > > Signed-off-by: Kautuk Consul <kconsul@linux.vnet.ibm.com>
> > > ---
> > >  arch/powerpc/include/asm/barrier.h | 7 +++++++
> > >  1 file changed, 7 insertions(+)
> > > 
> > > diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> > > index b95b666f0374..e088dacc0ee8 100644
> > > --- a/arch/powerpc/include/asm/barrier.h
> > > +++ b/arch/powerpc/include/asm/barrier.h
> > > @@ -36,8 +36,15 @@
> > >   * heavy-weight sync, so smp_wmb() can be a lighter-weight eieio.
> > >   */
> > >  #define __mb()   __asm__ __volatile__ ("sync" : : : "memory")
> > > +
> > > +/* The sub-arch has lwsync. */
> > > +#if defined(CONFIG_PPC64) || defined(CONFIG_PPC_E500MC)
> > > +#define __rmb() __asm__ __volatile__ ("lwsync" : : : "memory")
> > > +#define __wmb() __asm__ __volatile__ ("lwsync" : : : "memory")
> > 
> > Hmmm...
> > 
> > Does the lwsync instruction now order both cached and uncached accesses?
> > Or have there been changes so that smp_rmb() and smp_wmb() get this
> > definition, while rmb() and wmb() still get the sync instruction?
> > (Not seeing this, but I could easily be missing something.)

> Upfront I don't see any documentation that states that lwsync
> distinguishes between cached and uncached accesses.
> That's why I requested the mailing list for test results with
> kernel load testing.

I suggest giving the reference manual a very careful read.  I wish I
could be more helpful, but I found that a very long time ago, and no
longer recall exactly where it was stated.

But maybe Michael Ellerman has a pointer?

							Thanx, Paul

> > > +#else
> > >  #define __rmb()  __asm__ __volatile__ ("sync" : : : "memory")
> > >  #define __wmb()  __asm__ __volatile__ ("sync" : : : "memory")
> > > +#endif
> > >  
> > >  /* The sub-arch has lwsync */
> > >  #if defined(CONFIG_PPC64) || defined(CONFIG_PPC_E500MC)
> > > -- 
> > > 2.31.1
> > >
Kautuk Consul Feb. 23, 2023, 4:22 a.m. UTC | #13
On 2023-02-22 20:16:10, Paul E. McKenney wrote:
> On Thu, Feb 23, 2023 at 09:31:48AM +0530, Kautuk Consul wrote:
> > On 2023-02-22 09:47:19, Paul E. McKenney wrote:
> > > On Wed, Feb 22, 2023 at 02:33:44PM +0530, Kautuk Consul wrote:
> > > > A link from ibm.com states:
> > > > "Ensures that all instructions preceding the call to __lwsync
> > > >  complete before any subsequent store instructions can be executed
> > > >  on the processor that executed the function. Also, it ensures that
> > > >  all load instructions preceding the call to __lwsync complete before
> > > >  any subsequent load instructions can be executed on the processor
> > > >  that executed the function. This allows you to synchronize between
> > > >  multiple processors with minimal performance impact, as __lwsync
> > > >  does not wait for confirmation from each processor."
> > > > 
> > > > Thats why smp_rmb() and smp_wmb() are defined to lwsync.
> > > > But this same understanding applies to parallel pipeline
> > > > execution on each PowerPC processor.
> > > > So, use the lwsync instruction for rmb() and wmb() on the PPC
> > > > architectures that support it.
> > > > 
> > > > Signed-off-by: Kautuk Consul <kconsul@linux.vnet.ibm.com>
> > > > ---
> > > >  arch/powerpc/include/asm/barrier.h | 7 +++++++
> > > >  1 file changed, 7 insertions(+)
> > > > 
> > > > diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> > > > index b95b666f0374..e088dacc0ee8 100644
> > > > --- a/arch/powerpc/include/asm/barrier.h
> > > > +++ b/arch/powerpc/include/asm/barrier.h
> > > > @@ -36,8 +36,15 @@
> > > >   * heavy-weight sync, so smp_wmb() can be a lighter-weight eieio.
> > > >   */
> > > >  #define __mb()   __asm__ __volatile__ ("sync" : : : "memory")
> > > > +
> > > > +/* The sub-arch has lwsync. */
> > > > +#if defined(CONFIG_PPC64) || defined(CONFIG_PPC_E500MC)
> > > > +#define __rmb() __asm__ __volatile__ ("lwsync" : : : "memory")
> > > > +#define __wmb() __asm__ __volatile__ ("lwsync" : : : "memory")
> > > 
> > > Hmmm...
> > > 
> > > Does the lwsync instruction now order both cached and uncached accesses?
> > > Or have there been changes so that smp_rmb() and smp_wmb() get this
> > > definition, while rmb() and wmb() still get the sync instruction?
> > > (Not seeing this, but I could easily be missing something.)
> 
> > Upfront I don't see any documentation that states that lwsync
> > distinguishes between cached and uncached accesses.
> > That's why I requested the mailing list for test results with
> > kernel load testing.
> 
> I suggest giving the reference manual a very careful read.  I wish I
> could be more helpful, but I found that a very long time ago, and no
> longer recall exactly where it was stated.
Will do that as soon as I get an opprotunity.
> 
> But maybe Michael Ellerman has a pointer?
Sure. Maybe the cached and uncached accesses in these instructions should be
spelt out more clearly for newer people like me. :-)
Thanks for your time, Paul.
> 
> 							Thanx, Paul
> 
> > > > +#else
> > > >  #define __rmb()  __asm__ __volatile__ ("sync" : : : "memory")
> > > >  #define __wmb()  __asm__ __volatile__ ("sync" : : : "memory")
> > > > +#endif
> > > >  
> > > >  /* The sub-arch has lwsync */
> > > >  #if defined(CONFIG_PPC64) || defined(CONFIG_PPC_E500MC)
> > > > -- 
> > > > 2.31.1
> > > >
Kautuk Consul Feb. 23, 2023, 4:43 a.m. UTC | #14
> You are correct, the patch is wrong because it fails to account for IO
> accesses.

Okay, I looked at the PowerPC ISA and found:
"The memory barrier provides an ordering function for the storage accesses
caused by Load, Store,and dcbz instructions that are executed by the processor
executing the sync instruction and for which the specified storage location
is in storage that is Memory Coherence Required and is neitherWrite Through
Required nor Caching Inhibited.The applicable pairs are all pairs ai ,bj of
such accesses except those in which ai is an accesscaused by a Store or dcbz
instruction and bj is anaccess caused by a Load instruction."

Thanks for your time, Michael. Sorry for the noise.
> 
> Kautuk, I'm not sure what motivated you to look at these barriers, was
> it just the documentation you linked to?
I read the basic documentation. Now that I have access to the PowerISA
document I guess I'll go through it more thoroughly.
>
Michael Ellerman Feb. 23, 2023, 9:24 a.m. UTC | #15
Kautuk Consul <kconsul@linux.vnet.ibm.com> writes:
>  
>> You are correct, the patch is wrong because it fails to account for IO
>> accesses.
>
> Okay, I looked at the PowerPC ISA and found:
> "The memory barrier provides an ordering function for the storage accesses
> caused by Load, Store,and dcbz instructions that are executed by the processor
> executing the sync instruction and for which the specified storage location
> is in storage that is Memory Coherence Required and is neither Write Through
> Required nor Caching Inhibited.

Yep, that's the key sentence there. If you look at the definition for
"sync" it has not exceptions for different storage types.

I agree it's not very clear unless you're looking closely.

> Thanks for your time, Michael. Sorry for the noise.
>> 
>> Kautuk, I'm not sure what motivated you to look at these barriers, was
>> it just the documentation you linked to?

> I read the basic documentation. Now that I have access to the PowerISA
> document I guess I'll go through it more thoroughly.

The ISA is available publicly. There's links to most versions here:

https://wiki.raptorcs.com/wiki/Power_ISA

cheers
diff mbox series

Patch

diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index b95b666f0374..e088dacc0ee8 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -36,8 +36,15 @@ 
  * heavy-weight sync, so smp_wmb() can be a lighter-weight eieio.
  */
 #define __mb()   __asm__ __volatile__ ("sync" : : : "memory")
+
+/* The sub-arch has lwsync. */
+#if defined(CONFIG_PPC64) || defined(CONFIG_PPC_E500MC)
+#define __rmb() __asm__ __volatile__ ("lwsync" : : : "memory")
+#define __wmb() __asm__ __volatile__ ("lwsync" : : : "memory")
+#else
 #define __rmb()  __asm__ __volatile__ ("sync" : : : "memory")
 #define __wmb()  __asm__ __volatile__ ("sync" : : : "memory")
+#endif
 
 /* The sub-arch has lwsync */
 #if defined(CONFIG_PPC64) || defined(CONFIG_PPC_E500MC)