diff mbox series

[1/3] x86:Set preferred CPU features on the KH-40000 and KX-7000 Zhaoxin processors

Message ID 20240626024649.3689-1-MayShao-oc@zhaoxin.com
State New
Headers show
Series [1/3] x86:Set preferred CPU features on the KH-40000 and KX-7000 Zhaoxin processors | expand

Commit Message

Mayshao-oc June 26, 2024, 2:46 a.m. UTC
From: MayShao <mayshao-oc@zhaoxin.com>

Fix code indentation issues under the Zhaoxin branch.

Unaligned AVX load are slower on KH-40000 and KX-7000, so disable
the AVX_Fast_Unaligned_Load.

Enable Prefer_No_VZEROUPPER and Fast_Unaligned_Load features to
use sse2_unaligned version of memset,strcpy and strcat.
---
 sysdeps/x86/cpu-features.c | 66 ++++++++++++++++++++++----------------
 1 file changed, 39 insertions(+), 27 deletions(-)

Comments

Xi Ruoyao June 26, 2024, 2:56 a.m. UTC | #1
On Wed, 2024-06-26 at 10:46 +0800, MayShao wrote:
> From: MayShao <mayshao-oc@zhaoxin.com>
> 
> Fix code indentation issues under the Zhaoxin branch.
> 
> Unaligned AVX load are slower on KH-40000 and KX-7000, so disable
> the AVX_Fast_Unaligned_Load.
> 
> Enable Prefer_No_VZEROUPPER and Fast_Unaligned_Load features to
> use sse2_unaligned version of memset,strcpy and strcat.
> ---
>  sysdeps/x86/cpu-features.c | 66 ++++++++++++++++++++++---------------
> -
>  1 file changed, 39 insertions(+), 27 deletions(-)
> 
> diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
> index 3d7c2819d7..24fbf699b9 100644
> --- a/sysdeps/x86/cpu-features.c
> +++ b/sysdeps/x86/cpu-features.c
> @@ -1015,7 +1015,7 @@
> https://www.intel.com/content/www/us/en/support/articles/000059422/processors.ht
>        kind = arch_kind_zhaoxin;
>  
>        get_common_indices (cpu_features, &family, &model,
> &extended_model,
> -			  &stepping);
> +            &stepping);

Don't randomly change tabs to spaces.
Noah Goldstein June 26, 2024, 3:26 a.m. UTC | #2
On Wed, Jun 26, 2024 at 10:56 AM Xi Ruoyao <xry111@xry111.site> wrote:
>
> On Wed, 2024-06-26 at 10:46 +0800, MayShao wrote:
> > From: MayShao <mayshao-oc@zhaoxin.com>
> >
> > Fix code indentation issues under the Zhaoxin branch.
> >
> > Unaligned AVX load are slower on KH-40000 and KX-7000, so disable
> > the AVX_Fast_Unaligned_Load.
> >
> > Enable Prefer_No_VZEROUPPER and Fast_Unaligned_Load features to
> > use sse2_unaligned version of memset,strcpy and strcat.
> > ---
> >  sysdeps/x86/cpu-features.c | 66 ++++++++++++++++++++++---------------
> > -
> >  1 file changed, 39 insertions(+), 27 deletions(-)
> >
> > diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
> > index 3d7c2819d7..24fbf699b9 100644
> > --- a/sysdeps/x86/cpu-features.c
> > +++ b/sysdeps/x86/cpu-features.c
> > @@ -1015,7 +1015,7 @@
> > https://www.intel.com/content/www/us/en/support/articles/000059422/processors.ht
> >        kind = arch_kind_zhaoxin;
> >
> >        get_common_indices (cpu_features, &family, &model,
> > &extended_model,
> > -                       &stepping);
> > +            &stepping);
>
> Don't randomly change tabs to spaces.

We have a clang-format file, you should run it over the entire patch.

Also, can you comment or enum the different models?
>
> --
> Xi Ruoyao <xry111@xry111.site>
> School of Aerospace Science and Technology, Xidian University
Mayshao-oc June 26, 2024, 5:12 a.m. UTC | #3
On Wen, Jun 26, 2024 at 11:26 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Wed, Jun 26, 2024 at 10:56 AM Xi Ruoyao <xry111@xry111.site> wrote:
> >
> > On Wed, 2024-06-26 at 10:46 +0800, MayShao wrote:
> > > From: MayShao <mayshao-oc@zhaoxin.com>
> > >
> > > Fix code indentation issues under the Zhaoxin branch.
> > >
> > > Unaligned AVX load are slower on KH-40000 and KX-7000, so disable
> > > the AVX_Fast_Unaligned_Load.
> > >
> > > Enable Prefer_No_VZEROUPPER and Fast_Unaligned_Load features to
> > > use sse2_unaligned version of memset,strcpy and strcat.
> > > ---
> > >  sysdeps/x86/cpu-features.c | 66 ++++++++++++++++++++++---------------
> > > -
> > >  1 file changed, 39 insertions(+), 27 deletions(-)
> > >
> > > diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
> > > index 3d7c2819d7..24fbf699b9 100644
> > > --- a/sysdeps/x86/cpu-features.c
> > > +++ b/sysdeps/x86/cpu-features.c
> > > @@ -1015,7 +1015,7 @@
> > > https://www.intel.com/content/www/us/en/support/articles/000059422/processors.ht
> > >        kind = arch_kind_zhaoxin;
> > >
> > >        get_common_indices (cpu_features, &family, &model,
> > > &extended_model,
> > > -                       &stepping);
> > > +            &stepping);
> >
> > Don't randomly change tabs to spaces.
> 
> We have a clang-format file, you should run it over the entire patch.
> 
Ack. I will fix this.  

> Also, can you comment or enum the different models?

Will fix it.

> >
> > --
> > Xi Ruoyao <xry111@xry111.site>
> > School of Aerospace Science and Technology, Xidian University> >
Florian Weimer June 26, 2024, 11:01 a.m. UTC | #4
* MayShao:

> From: MayShao <mayshao-oc@zhaoxin.com>
>
> Fix code indentation issues under the Zhaoxin branch.
>
> Unaligned AVX load are slower on KH-40000 and KX-7000, so disable
> the AVX_Fast_Unaligned_Load.
>
> Enable Prefer_No_VZEROUPPER and Fast_Unaligned_Load features to
> use sse2_unaligned version of memset,strcpy and strcat.

Somewhat related to that, do you have documentation of the behavior of
*aligned* 128-bit loads?  Are they guaranteed to be atomic?
At least if MOVAPD, MOVAPS, MOVDQA are used?

Thanks,
Florian
Mayshao-oc June 27, 2024, 6:04 a.m. UTC | #5
On Wen, Jun 16,2024 7:01 PM  Florian Weimer <fweimer@redhat.com> wrote:
>
> * MayShao:
> 
> > From: MayShao <mayshao-oc@zhaoxin.com>
> >
> > Fix code indentation issues under the Zhaoxin branch.
> >
> > Unaligned AVX load are slower on KH-40000 and KX-7000, so disable
> > the AVX_Fast_Unaligned_Load.
> >
> > Enable Prefer_No_VZEROUPPER and Fast_Unaligned_Load features to
> > use sse2_unaligned version of memset,strcpy and strcat.
> 
> Somewhat related to that, do you have documentation of the behavior of
> *aligned* 128-bit loads?  Are they guaranteed to be atomic?
> At least if MOVAPD, MOVAPS, MOVDQA are used?

I can confirm is that aligned 128-bit loads (such as MOVAPD, MOVAPS, 
MOVDQA) in the WB memory region are atomic, and for unaligned 
128-bit loads, it can also be guaranteed to be atomic if within a cacheline.

>
> Thanks,
> Florian
>
Florian Weimer June 27, 2024, 6:32 a.m. UTC | #6
> On Wen, Jun 16,2024 7:01 PM  Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * MayShao:
>> 
>> > From: MayShao <mayshao-oc@zhaoxin.com>
>> >
>> > Fix code indentation issues under the Zhaoxin branch.
>> >
>> > Unaligned AVX load are slower on KH-40000 and KX-7000, so disable
>> > the AVX_Fast_Unaligned_Load.
>> >
>> > Enable Prefer_No_VZEROUPPER and Fast_Unaligned_Load features to
>> > use sse2_unaligned version of memset,strcpy and strcat.
>> 
>> Somewhat related to that, do you have documentation of the behavior of
>> *aligned* 128-bit loads?  Are they guaranteed to be atomic?
>> At least if MOVAPD, MOVAPS, MOVDQA are used?
>
> I can confirm is that aligned 128-bit loads (such as MOVAPD, MOVAPS, 
> MOVDQA) in the WB memory region are atomic, and for unaligned 
> 128-bit loads, it can also be guaranteed to be atomic if within a cacheline.

This is great news.  Could you update this GCC bug with the information?

  Bug 104688 - gcc and libatomic can use SSE for 128-bit atomic loads
  on Intel and AMD CPUs with AVX
  <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688>

I think it means we can teach GCC to use 128-bit atomic loads
unconditionally for AVX targets (bypassing libatomic).

Thanks,
Florian
Xi Ruoyao June 27, 2024, 6:37 a.m. UTC | #7
On Thu, 2024-06-27 at 08:32 +0200, Florian Weimer wrote:
> > On Wen, Jun 16,2024 7:01 PM  Florian Weimer <fweimer@redhat.com>
> > wrote:
> > > 
> > > * MayShao:
> > > 
> > > > From: MayShao <mayshao-oc@zhaoxin.com>
> > > > 
> > > > Fix code indentation issues under the Zhaoxin branch.
> > > > 
> > > > Unaligned AVX load are slower on KH-40000 and KX-7000, so
> > > > disable
> > > > the AVX_Fast_Unaligned_Load.
> > > > 
> > > > Enable Prefer_No_VZEROUPPER and Fast_Unaligned_Load features to
> > > > use sse2_unaligned version of memset,strcpy and strcat.
> > > 
> > > Somewhat related to that, do you have documentation of the
> > > behavior of
> > > *aligned* 128-bit loads?  Are they guaranteed to be atomic?
> > > At least if MOVAPD, MOVAPS, MOVDQA are used?
> > 
> > I can confirm is that aligned 128-bit loads (such as MOVAPD, MOVAPS,
> > MOVDQA) in the WB memory region are atomic, and for unaligned 
> > 128-bit loads, it can also be guaranteed to be atomic if within a
> > cacheline.
> 
> This is great news.  Could you update this GCC bug with the
> information?
> 
>   Bug 104688 - gcc and libatomic can use SSE for 128-bit atomic loads
>   on Intel and AMD CPUs with AVX
>   <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688>

The info is already there.  But the discussion stalled after some
concern about non-Write-Back memory regions was raised.
diff mbox series

Patch

diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
index 3d7c2819d7..24fbf699b9 100644
--- a/sysdeps/x86/cpu-features.c
+++ b/sysdeps/x86/cpu-features.c
@@ -1015,7 +1015,7 @@  https://www.intel.com/content/www/us/en/support/articles/000059422/processors.ht
       kind = arch_kind_zhaoxin;
 
       get_common_indices (cpu_features, &family, &model, &extended_model,
-			  &stepping);
+            &stepping);
 
       get_extended_indices (cpu_features);
 
@@ -1026,38 +1026,50 @@  https://www.intel.com/content/www/us/en/support/articles/000059422/processors.ht
         {
           if (model == 0xf || model == 0x19)
             {
-	      CPU_FEATURE_UNSET (cpu_features, AVX);
-	      CPU_FEATURE_UNSET (cpu_features, AVX2);
+                CPU_FEATURE_UNSET (cpu_features, AVX);
+                CPU_FEATURE_UNSET (cpu_features, AVX2);
 
-              cpu_features->preferred[index_arch_Slow_SSE4_2]
-                |= bit_arch_Slow_SSE4_2;
+                cpu_features->preferred[index_arch_Slow_SSE4_2]
+                  |= bit_arch_Slow_SSE4_2;
 
-	      cpu_features->preferred[index_arch_AVX_Fast_Unaligned_Load]
-		&= ~bit_arch_AVX_Fast_Unaligned_Load;
+                cpu_features->preferred[index_arch_AVX_Fast_Unaligned_Load]
+                  &= ~bit_arch_AVX_Fast_Unaligned_Load;
             }
         }
       else if (family == 0x7)
         {
-	  if (model == 0x1b)
-	    {
-	      CPU_FEATURE_UNSET (cpu_features, AVX);
-	      CPU_FEATURE_UNSET (cpu_features, AVX2);
-
-	      cpu_features->preferred[index_arch_Slow_SSE4_2]
-		|= bit_arch_Slow_SSE4_2;
-
-	      cpu_features->preferred[index_arch_AVX_Fast_Unaligned_Load]
-		&= ~bit_arch_AVX_Fast_Unaligned_Load;
-	    }
-	  else if (model == 0x3b)
-	    {
-	      CPU_FEATURE_UNSET (cpu_features, AVX);
-	      CPU_FEATURE_UNSET (cpu_features, AVX2);
-
-	      cpu_features->preferred[index_arch_AVX_Fast_Unaligned_Load]
-		&= ~bit_arch_AVX_Fast_Unaligned_Load;
-	    }
-	}
+          switch (model)
+            {
+            case 0x1b:
+                CPU_FEATURE_UNSET (cpu_features, AVX);
+                CPU_FEATURE_UNSET (cpu_features, AVX2);
+
+                cpu_features->preferred[index_arch_Slow_SSE4_2]
+                  |= bit_arch_Slow_SSE4_2;
+
+                cpu_features->preferred[index_arch_AVX_Fast_Unaligned_Load]
+                  &= ~bit_arch_AVX_Fast_Unaligned_Load;
+                break;
+
+            case 0x3b:
+                CPU_FEATURE_UNSET (cpu_features, AVX);
+                CPU_FEATURE_UNSET (cpu_features, AVX2);
+
+                cpu_features->preferred[index_arch_AVX_Fast_Unaligned_Load]
+                  &= ~bit_arch_AVX_Fast_Unaligned_Load;
+                break;
+
+            case 0x5b:
+            case 0x6b:
+                cpu_features->preferred[index_arch_AVX_Fast_Unaligned_Load]
+                  &= ~bit_arch_AVX_Fast_Unaligned_Load;
+
+                cpu_features->preferred[index_arch_Prefer_No_VZEROUPPER]
+                  |= (bit_arch_Prefer_No_VZEROUPPER
+                  | bit_arch_Fast_Unaligned_Load);
+                break;
+            }
+        }
     }
   else
     {