Message ID | D4C76825A6780047854A11E93CDE84D005980DC6E8@SAUSEXMBP01.amd.com |
---|---|
State | New |
Headers | show |
On Tue, Jun 14, 2011 at 1:59 AM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote: > Hi, > > The patch ( http://gcc.gnu.org/ml/gcc-patches/2011-02/txt00059.txt ) which introduces splitting avx256 unaligned loads. > However, we found that it causes significant regressions for cpu2006 ( http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49089 ). > > In this work, we introduce a tune option that sets splitting unaligned loads default only for such CPUs that such splitting > is beneficial. > > The patch passed bootstrapping and regression tests on x86_64-unknown-linux-gnu system. > > Is it OK to commit? It probably should go to the 4.6 branch as well. Note that I find the X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL odd, why not call it simply X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD? I'll defer to x86 maintainers for approval. Richard. > Thanks, > > Changpeng
On Tue, Jun 14, 2011 at 12:13:47PM +0200, Richard Guenther wrote: > On Tue, Jun 14, 2011 at 1:59 AM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote: > > The patch ( http://gcc.gnu.org/ml/gcc-patches/2011-02/txt00059.txt ) which introduces splitting avx256 unaligned loads. > > However, we found that it causes significant regressions for cpu2006 ( http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49089 ). > > > > In this work, we introduce a tune option that sets splitting unaligned loads default only for such CPUs that such splitting > > is beneficial. > > > > The patch passed bootstrapping and regression tests on x86_64-unknown-linux-gnu system. > > > > Is it OK to commit? > > It probably should go to the 4.6 branch as well. Note that I find the > X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL odd, > why not call it simply X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD? I also wonder what we should do for -mtune=generic. Should we split or not? How big improvement is it on Intel chips, how big degradation does it cause on AMD chips (I assume no other chip maker currently supports AVX)? Jakub
On Tue, Jun 14, 2011 at 3:16 AM, Jakub Jelinek <jakub@redhat.com> wrote: > On Tue, Jun 14, 2011 at 12:13:47PM +0200, Richard Guenther wrote: >> On Tue, Jun 14, 2011 at 1:59 AM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote: >> > The patch ( http://gcc.gnu.org/ml/gcc-patches/2011-02/txt00059.txt ) which introduces splitting avx256 unaligned loads. >> > However, we found that it causes significant regressions for cpu2006 ( http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49089 ). >> > >> > In this work, we introduce a tune option that sets splitting unaligned loads default only for such CPUs that such splitting >> > is beneficial. >> > >> > The patch passed bootstrapping and regression tests on x86_64-unknown-linux-gnu system. >> > >> > Is it OK to commit? >> >> It probably should go to the 4.6 branch as well. Note that I find the >> X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL odd, >> why not call it simply X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD? > > I also wonder what we should do for -mtune=generic. Should we split or not? > How big improvement is it on Intel chips, how big degradation does it > cause on AMD chips (I assume no other chip maker currently supports AVX)? > Simply turning off 32byte aligned load split, which introduces performance regressions on Intel Sandy Bridge processors, isn't an appropriate solution. I am proposing a different approach so that we can improve -mtune=generic performance on current Intel and AMD processors. The current default GCC tuning, -mtune=generic, was implemented in 2005 for Intel Pentium 4, Core 2 and AMD K8 processors. Many optimization choices are no longer applicable to the current Intel nor AMD processors. We should choose a set of optimization choices for -mtune=generic, including 32byte unaligned load split, for the current Intel and AMD processors, which should improve performance with no performance regressions.
>It probably should go to the 4.6 branch as well. H.J. Lu's original patch that splits unaligned load and store was checked in gcc 4.7 trunk. We found that, splitting unaligned store is beneficial to bdver1, splitting unaligned load degrades cfp2006 by 1.3% in geomean on Bulldozer. As a result, we suggest putting unaligned store splitting (H.J. original patch + this one) back to 4.6 branch. >Note that I find the >X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL odd, >why not call it simply X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD? AVX256_SPLIT_UNALIGNED_LOAD has already been used for the flag: -mavx256-split-unaligned-load, and we intend to keep that flag for performance tuning. As a result, we put _OPTIMAL (or _BENEFICAL) at the end for default setting. >I'll defer to x86 maintainers for approval. So, is it OK to commit this patch to trunk, and H.J's original patch + this to 4.6 branch? Thanks, Changpeng >Richard. > Thanks, > > Changpeng
On Tue, Jun 14, 2011 at 3:41 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote: >>It probably should go to the 4.6 branch as well. > > H.J. Lu's original patch that splits unaligned load and store was checked in gcc 4.7 > trunk. We found that, splitting unaligned store is beneficial to bdver1, splitting unaligned > load degrades cfp2006 by 1.3% in geomean on Bulldozer. As a result, we suggest putting > unaligned store splitting (H.J. original patch + this one) back to 4.6 branch. > > >Note that I find the >>X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL odd, >>why not call it simply X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD? > > AVX256_SPLIT_UNALIGNED_LOAD has already been used for the flag: > -mavx256-split-unaligned-load, and we intend to keep that flag for performance tuning. > As a result, we put _OPTIMAL (or _BENEFICAL) at the end for default setting. > > >>I'll defer to x86 maintainers for approval. > > So, is it OK to commit this patch to trunk, and H.J's original patch + this to 4.6 branch? > > I have no problems on -mtune=Bulldozer. But I object -mtune=generic change and did suggest a different approach for -mtune=generic.
A similar argument is for software prefetching, which we observed a ~2% benefit on greyhound (not that much for Bulldozer). We would also prefer turning on software prefetching at -O3 for -mtune=generic. --Changprng
On Tue, Jun 14, 2011 at 4:01 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote: > A similar argument is for software prefetching, which we observed a ~2% benefit on greyhound (not that much > for Bulldozer). We would also prefer turning on software prefetching at -O3 for -mtune=generic. Sure, we can put everything on the table and take a look. > Simply turning off 32byte aligned load split, which introduces > performance regressions on > Intel Sandy Bridge processors, isn't an appropriate solution. > > I am proposing a different approach so that we can improve > -mtune=generic performance > on current Intel and AMD processors. > > The current default GCC tuning, -mtune=generic, was implemented in > 2005 for Intel > Pentium 4, Core 2 and AMD K8 processors. Many optimization choices > are no longer > applicable to the current Intel nor AMD processors. > > We should choose a set of optimization choices for -mtune=generic, > including 32byte > unaligned load split, for the current Intel and AMD processors, which > should improve > performance with no performance regressions. > >
> > So, is it OK to commit this patch to trunk, and H.J's original patch + this to 4.6 branch? >I have no problems on -mtune=Bulldozer. But I object -mtune=generic >change and did suggest a different approach for -mtune=generic. What's your suggested approach for -mtune=generic? My understanding of putting something into -mtune=generic is that all are happy, or some are willing to lose. This rule guided us not to turn on software prefetching for -mtune=generic. I would like to hear your suggestion that sandy bridge could get the performance while bulldozer does not lose from splitting avx256 unaligned loads! Thanks, Changpeng
On Tue, Jun 14, 2011 at 4:59 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote: > > >> >> So, is it OK to commit this patch to trunk, and H.J's original patch + this to 4.6 branch? > >>I have no problems on -mtune=Bulldozer. But I object -mtune=generic >>change and did suggest a different approach for -mtune=generic. > > What's your suggested approach for -mtune=generic? > > My understanding of putting something into -mtune=generic is that all are happy, or > some are willing to lose. This rule guided us not to turn on software prefetching for > -mtune=generic. > > I would like to hear your suggestion that sandy bridge could get the performance while > bulldozer does not lose from splitting avx256 unaligned loads! > > Thanks, See: http://gcc.gnu.org/ml/gcc-patches/2006-01/msg01045.html for how we did it last time.
>I have no problems on -mtune=Bulldozer. But I object -mtune=generic >change and did suggest a different approach for -mtune=generic. Something must have been broken for the unaligned load splitting in generic mode. While we lose 1.3% on CFP2006 in geomean by splitting unaligned loads for -mtune=bdver1, splitting unaligned loads in generic mode is KILLING us: For 459.GemsFDTD (ref) on Bulldozer, -Ofast -mavx -mno-avx256-split-unaligned-load: 480s -Ofast -mavx : 2527s So, splitting unaligned loads results in the program to run 5~6 times slower! For 434.zeusmp train run -Ofast -mavx -mno-avx256-split-unaligned-load: 32.5s -Ofast -mavx : 106s Other tests are on-going! Changpeng.
On Wed, Jun 15, 2011 at 11:06 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote: >>I have no problems on -mtune=Bulldozer. But I object -mtune=generic >>change and did suggest a different approach for -mtune=generic. > > Something must have been broken for the unaligned load splitting in generic mode. > > While we lose 1.3% on CFP2006 in geomean by splitting unaligned loads for -mtune=bdver1, splitting > unaligned loads in generic mode is KILLING us: > > For 459.GemsFDTD (ref) on Bulldozer, > -Ofast -mavx -mno-avx256-split-unaligned-load: 480s > -Ofast -mavx : 2527s > > So, splitting unaligned loads results in the program to run 5~6 times slower! > > For 434.zeusmp train run > -Ofast -mavx -mno-avx256-split-unaligned-load: 32.5s > -Ofast -mavx : 106s > > Other tests are on-going! I suspect that the split loads get further split into mov[lh]ps pieces? We do that for SSE moves with generic tuning at least IIRC. Richard. > > Changpeng. > > >
From 415012803abf2cac95c067394504c55dd968f4f5 Mon Sep 17 00:00:00 2001 From: Changpeng Fang <chfang@huainan.(none)> Date: Mon, 13 Jun 2011 13:13:32 -0700 Subject: [PATCH] pr49089: enable avx256 splitting unaligned load only when beneficial * config/i386/i386.h (ix86_tune_indices): Introduce X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL. (TARGET_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL): New definition. * config/i386/i386.c (ix86_tune_features): Add entry for X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL. (ix86_option_override_internal): Enable avx256 unaligned load splitting only when TARGET_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL is set. --- gcc/config/i386/i386.c | 9 +++++++-- gcc/config/i386/i386.h | 3 +++ 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 7b266b9..d5f358f 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2088,7 +2088,11 @@ static unsigned int initial_ix86_tune_features[X86_TUNE_LAST] = { /* X86_SOFTARE_PREFETCHING_BENEFICIAL: Enable software prefetching at -O3. For the moment, the prefetching seems badly tuned for Intel chips. */ - m_K6_GEODE | m_AMD_MULTIPLE + m_K6_GEODE | m_AMD_MULTIPLE, + + /* X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL: Enable splitting 256-bit + unaligned load. It hurts the performance on Bulldozer. */ + m_COREI7 }; /* Feature tests against the various architecture variations. */ @@ -4194,7 +4198,8 @@ ix86_option_override_internal (bool main_args_p) if (flag_expensive_optimizations && !(target_flags_explicit & MASK_VZEROUPPER)) target_flags |= MASK_VZEROUPPER; - if (!(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_LOAD)) + if (TARGET_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL + && !(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_LOAD)) target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD; if (!(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_STORE)) target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE; diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index 8badcbb..b2a1bc8 100644 --- a/gcc/config/i386/i386.h +++ b/gcc/config/i386/i386.h @@ -312,6 +312,7 @@ enum ix86_tune_indices { X86_TUNE_OPT_AGU, X86_TUNE_VECTORIZE_DOUBLE, X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL, + X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL, X86_TUNE_LAST }; @@ -410,6 +411,8 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST]; ix86_tune_features[X86_TUNE_VECTORIZE_DOUBLE] #define TARGET_SOFTWARE_PREFETCHING_BENEFICIAL \ ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL] +#define TARGET_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL \ + ix86_tune_features[X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL] /* Feature tests against the various architecture variations. */ enum ix86_arch_indices { -- 1.7.0.4