diff mbox

[PR,49089] Don't split AVX256 unaligned loads by default on bdver1 and generic

Message ID D4C76825A6780047854A11E93CDE84D005980DC6E8@SAUSEXMBP01.amd.com
State New
Headers show

Commit Message

Fang, Changpeng June 13, 2011, 11:59 p.m. UTC
Hi,

The patch ( http://gcc.gnu.org/ml/gcc-patches/2011-02/txt00059.txt ) which introduces splitting avx256 unaligned loads. 
However, we found that it causes significant regressions for cpu2006 ( http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49089 ).

In this work, we introduce a tune option that sets splitting unaligned loads default only for such CPUs that such splitting
is beneficial.

The patch passed bootstrapping and regression tests on x86_64-unknown-linux-gnu system.

Is it OK to commit?

Thanks,

Changpeng

Comments

Richard Biener June 14, 2011, 10:13 a.m. UTC | #1
On Tue, Jun 14, 2011 at 1:59 AM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
> Hi,
>
> The patch ( http://gcc.gnu.org/ml/gcc-patches/2011-02/txt00059.txt ) which introduces splitting avx256 unaligned loads.
> However, we found that it causes significant regressions for cpu2006 ( http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49089 ).
>
> In this work, we introduce a tune option that sets splitting unaligned loads default only for such CPUs that such splitting
> is beneficial.
>
> The patch passed bootstrapping and regression tests on x86_64-unknown-linux-gnu system.
>
> Is it OK to commit?

It probably should go to the 4.6 branch as well.  Note that I find the
X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL odd,
why not call it simply X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD?

I'll defer to x86 maintainers for approval.

Richard.

> Thanks,
>
> Changpeng
Jakub Jelinek June 14, 2011, 10:16 a.m. UTC | #2
On Tue, Jun 14, 2011 at 12:13:47PM +0200, Richard Guenther wrote:
> On Tue, Jun 14, 2011 at 1:59 AM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
> > The patch ( http://gcc.gnu.org/ml/gcc-patches/2011-02/txt00059.txt ) which introduces splitting avx256 unaligned loads.
> > However, we found that it causes significant regressions for cpu2006 ( http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49089 ).
> >
> > In this work, we introduce a tune option that sets splitting unaligned loads default only for such CPUs that such splitting
> > is beneficial.
> >
> > The patch passed bootstrapping and regression tests on x86_64-unknown-linux-gnu system.
> >
> > Is it OK to commit?
> 
> It probably should go to the 4.6 branch as well.  Note that I find the
> X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL odd,
> why not call it simply X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD?

I also wonder what we should do for -mtune=generic.  Should we split or not?
How big improvement is it on Intel chips, how big degradation does it
cause on AMD chips (I assume no other chip maker currently supports AVX)?

	Jakub
H.J. Lu June 14, 2011, 1:05 p.m. UTC | #3
On Tue, Jun 14, 2011 at 3:16 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Tue, Jun 14, 2011 at 12:13:47PM +0200, Richard Guenther wrote:
>> On Tue, Jun 14, 2011 at 1:59 AM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
>> > The patch ( http://gcc.gnu.org/ml/gcc-patches/2011-02/txt00059.txt ) which introduces splitting avx256 unaligned loads.
>> > However, we found that it causes significant regressions for cpu2006 ( http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49089 ).
>> >
>> > In this work, we introduce a tune option that sets splitting unaligned loads default only for such CPUs that such splitting
>> > is beneficial.
>> >
>> > The patch passed bootstrapping and regression tests on x86_64-unknown-linux-gnu system.
>> >
>> > Is it OK to commit?
>>
>> It probably should go to the 4.6 branch as well.  Note that I find the
>> X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL odd,
>> why not call it simply X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD?
>
> I also wonder what we should do for -mtune=generic.  Should we split or not?
> How big improvement is it on Intel chips, how big degradation does it
> cause on AMD chips (I assume no other chip maker currently supports AVX)?
>

Simply turning off 32byte aligned load split, which introduces
performance regressions on
Intel Sandy Bridge processors, isn't an appropriate solution.

I am proposing a different approach so that we can improve
-mtune=generic performance
on current Intel and AMD processors.

The current default GCC tuning, -mtune=generic, was implemented in
2005 for Intel
Pentium 4, Core 2 and AMD K8 processors.  Many optimization choices
are no longer
applicable to the current Intel nor AMD processors.

We should choose a set of optimization choices for -mtune=generic,
including 32byte
unaligned load split, for the current Intel and AMD processors,  which
should improve
performance with no performance regressions.
Fang, Changpeng June 14, 2011, 10:41 p.m. UTC | #4
>It probably should go to the 4.6 branch as well.

H.J. Lu's original patch that splits unaligned load and store was checked in gcc 4.7
trunk. We found that,  splitting unaligned store is beneficial to bdver1, splitting unaligned
load degrades cfp2006 by 1.3% in geomean on Bulldozer. As a result, we suggest putting
unaligned store splitting (H.J. original patch + this one) back to 4.6 branch.

 >Note that I find the
>X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL odd,
>why not call it simply X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD?

AVX256_SPLIT_UNALIGNED_LOAD has already  been used for the flag: 
-mavx256-split-unaligned-load, and we intend to keep that flag for performance tuning. 
As a result, we put _OPTIMAL (or _BENEFICAL) at the end for default setting.


>I'll defer to x86 maintainers for approval.

So, is it OK to commit this patch to trunk, and H.J's original patch + this to 4.6 branch?


Thanks,

Changpeng


>Richard.

> Thanks,
>
> Changpeng
H.J. Lu June 14, 2011, 10:46 p.m. UTC | #5
On Tue, Jun 14, 2011 at 3:41 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
>>It probably should go to the 4.6 branch as well.
>
> H.J. Lu's original patch that splits unaligned load and store was checked in gcc 4.7
> trunk. We found that,  splitting unaligned store is beneficial to bdver1, splitting unaligned
> load degrades cfp2006 by 1.3% in geomean on Bulldozer. As a result, we suggest putting
> unaligned store splitting (H.J. original patch + this one) back to 4.6 branch.
>
>  >Note that I find the
>>X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL odd,
>>why not call it simply X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD?
>
> AVX256_SPLIT_UNALIGNED_LOAD has already  been used for the flag:
> -mavx256-split-unaligned-load, and we intend to keep that flag for performance tuning.
> As a result, we put _OPTIMAL (or _BENEFICAL) at the end for default setting.
>
>
>>I'll defer to x86 maintainers for approval.
>
> So, is it OK to commit this patch to trunk, and H.J's original patch + this to 4.6 branch?
>
>

I have no problems on -mtune=Bulldozer.  But I object -mtune=generic
change and did suggest a different approach for -mtune=generic.
Fang, Changpeng June 14, 2011, 11:01 p.m. UTC | #6
A similar argument is for software prefetching, which we observed a ~2% benefit on greyhound (not that much
for Bulldozer). We would also prefer turning on software prefetching at -O3 for -mtune=generic.

--Changprng
H.J. Lu June 14, 2011, 11:58 p.m. UTC | #7
On Tue, Jun 14, 2011 at 4:01 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
> A similar argument is for software prefetching, which we observed a ~2% benefit on greyhound (not that much
> for Bulldozer). We would also prefer turning on software prefetching at -O3 for -mtune=generic.

Sure, we can put everything on the table and take a look.

> Simply turning off 32byte aligned load split, which introduces
> performance regressions on
> Intel Sandy Bridge processors, isn't an appropriate solution.
>
> I am proposing a different approach so that we can improve
> -mtune=generic performance
> on current Intel and AMD processors.
>
> The current default GCC tuning, -mtune=generic, was implemented in
> 2005 for Intel
> Pentium 4, Core 2 and AMD K8 processors.  Many optimization choices
> are no longer
> applicable to the current Intel nor AMD processors.
>
> We should choose a set of optimization choices for -mtune=generic,
> including 32byte
> unaligned load split, for the current Intel and AMD processors,  which
> should improve
> performance with no performance regressions.
>
>
Fang, Changpeng June 14, 2011, 11:59 p.m. UTC | #8
>
> So, is it OK to commit this patch to trunk, and H.J's original patch + this to 4.6 branch?

>I have no problems on -mtune=Bulldozer.  But I object -mtune=generic
>change and did suggest a different approach for -mtune=generic.

What's your suggested approach for -mtune=generic?

My understanding of putting something into -mtune=generic is that all are happy, or
some are willing to lose. This rule guided us not to turn on software prefetching for 
-mtune=generic.

I would like to hear your suggestion that sandy bridge could get the performance while
bulldozer does not lose from splitting avx256 unaligned loads!

Thanks,

Changpeng
H.J. Lu June 15, 2011, 12:35 a.m. UTC | #9
On Tue, Jun 14, 2011 at 4:59 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
>
>
>>
>> So, is it OK to commit this patch to trunk, and H.J's original patch + this to 4.6 branch?
>
>>I have no problems on -mtune=Bulldozer.  But I object -mtune=generic
>>change and did suggest a different approach for -mtune=generic.
>
> What's your suggested approach for -mtune=generic?
>
> My understanding of putting something into -mtune=generic is that all are happy, or
> some are willing to lose. This rule guided us not to turn on software prefetching for
> -mtune=generic.
>
> I would like to hear your suggestion that sandy bridge could get the performance while
> bulldozer does not lose from splitting avx256 unaligned loads!
>
> Thanks,

See:

http://gcc.gnu.org/ml/gcc-patches/2006-01/msg01045.html

for how we did it last time.
Fang, Changpeng June 15, 2011, 9:06 p.m. UTC | #10
>I have no problems on -mtune=Bulldozer.  But I object -mtune=generic
>change and did suggest a different approach for -mtune=generic.

Something must have been broken for the unaligned load splitting in generic mode.

While we lose 1.3% on CFP2006 in geomean by splitting unaligned loads for -mtune=bdver1, splitting
unaligned loads in generic mode is KILLING us:

For 459.GemsFDTD (ref) on Bulldozer,
 -Ofast -mavx -mno-avx256-split-unaligned-load:   480s
-Ofast -mavx                                                       :    2527s

So, splitting unaligned loads results in the program to run 5~6 times slower!

For 434.zeusmp train run
 -Ofast -mavx -mno-avx256-split-unaligned-load:   32.5s
-Ofast -mavx                                                       :    106s

Other tests are on-going!


Changpeng.
Richard Biener June 16, 2011, 7:39 a.m. UTC | #11
On Wed, Jun 15, 2011 at 11:06 PM, Fang, Changpeng
<Changpeng.Fang@amd.com> wrote:
>>I have no problems on -mtune=Bulldozer.  But I object -mtune=generic
>>change and did suggest a different approach for -mtune=generic.
>
> Something must have been broken for the unaligned load splitting in generic mode.
>
> While we lose 1.3% on CFP2006 in geomean by splitting unaligned loads for -mtune=bdver1, splitting
> unaligned loads in generic mode is KILLING us:
>
> For 459.GemsFDTD (ref) on Bulldozer,
>  -Ofast -mavx -mno-avx256-split-unaligned-load:   480s
> -Ofast -mavx                                                       :    2527s
>
> So, splitting unaligned loads results in the program to run 5~6 times slower!
>
> For 434.zeusmp train run
>  -Ofast -mavx -mno-avx256-split-unaligned-load:   32.5s
> -Ofast -mavx                                                       :    106s
>
> Other tests are on-going!

I suspect that the split loads get further split into mov[lh]ps pieces?
We do that for SSE moves with generic tuning at least IIRC.

Richard.

>
> Changpeng.
>
>
>
diff mbox

Patch

From 415012803abf2cac95c067394504c55dd968f4f5 Mon Sep 17 00:00:00 2001
From: Changpeng Fang <chfang@huainan.(none)>
Date: Mon, 13 Jun 2011 13:13:32 -0700
Subject: [PATCH] pr49089: enable avx256 splitting unaligned load only when beneficial

	* config/i386/i386.h (ix86_tune_indices): Introduce
	  X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL.
	  (TARGET_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL): New definition.

	* config/i386/i386.c (ix86_tune_features): Add entry for
	  X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL.
	  (ix86_option_override_internal): Enable avx256 unaligned load splitting
	  only when TARGET_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL is set.
---
 gcc/config/i386/i386.c |    9 +++++++--
 gcc/config/i386/i386.h |    3 +++
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 7b266b9..d5f358f 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2088,7 +2088,11 @@  static unsigned int initial_ix86_tune_features[X86_TUNE_LAST] = {
   /* X86_SOFTARE_PREFETCHING_BENEFICIAL: Enable software prefetching
      at -O3.  For the moment, the prefetching seems badly tuned for Intel
      chips.  */
-  m_K6_GEODE | m_AMD_MULTIPLE
+  m_K6_GEODE | m_AMD_MULTIPLE,
+
+  /* X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL: Enable splitting 256-bit
+     unaligned load.  It hurts the performance on Bulldozer.  */
+  m_COREI7
 };
 
 /* Feature tests against the various architecture variations.  */
@@ -4194,7 +4198,8 @@  ix86_option_override_internal (bool main_args_p)
 	  if (flag_expensive_optimizations
 	      && !(target_flags_explicit & MASK_VZEROUPPER))
 	    target_flags |= MASK_VZEROUPPER;
-	  if (!(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_LOAD))
+	  if (TARGET_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL
+	      && !(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_LOAD))
 	    target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD;
 	  if (!(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_STORE))
 	    target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 8badcbb..b2a1bc8 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -312,6 +312,7 @@  enum ix86_tune_indices {
   X86_TUNE_OPT_AGU,
   X86_TUNE_VECTORIZE_DOUBLE,
   X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL,
+  X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL,
 
   X86_TUNE_LAST
 };
@@ -410,6 +411,8 @@  extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 	ix86_tune_features[X86_TUNE_VECTORIZE_DOUBLE]
 #define TARGET_SOFTWARE_PREFETCHING_BENEFICIAL \
 	ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL]
+#define TARGET_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL \
+	ix86_tune_features[X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL]
 
 /* Feature tests against the various architecture variations.  */
 enum ix86_arch_indices {
-- 
1.7.0.4