Message ID | D4C76825A6780047854A11E93CDE84D004D3054E8E@SAUSEXMBP01.amd.com |
---|---|
State | New |
Headers | show |
On Thu, 10 Feb 2011, Fang, Changpeng wrote: > Hi, > > Attached is the patch to force gcc to generate 128-bit avx instructions for bdver1. We found that for > the current Bulldozer processors, AVX128 performs better than AVX256. For example, AVX128 is 3% > faster than AVX256 on CFP2006, and 2~3% faster than AVX256 on polyhedron. > > As a result, we prefer gcc 4.6 to generate 128-bit avx instructions only (for bdver1). > > The patch passed bootstrapping on x86_64-unknown-linux-gnu with "-O3 -g -march=bdver1" and > the necessary correctness and performance. > > Is it OK to commit to trunk? I think there was no attempt to tune anything for AVX256, in particular the vectorizer cost model may be completely off. HJ and Andi also hinted at some alignment problems (at least SB seems to have a large penalty when loads cross a cacheline boundary). So - did you do any investigation on why 256bit vectors are slower for you? Are these cases that the cost model could easily catch? Thanks, Richard.
On Fri, Feb 11, 2011 at 1:46 AM, Richard Guenther <rguenther@suse.de> wrote: >> Attached is the patch to force gcc to generate 128-bit avx instructions for bdver1. We found that for >> the current Bulldozer processors, AVX128 performs better than AVX256. For example, AVX128 is 3% >> faster than AVX256 on CFP2006, and 2~3% faster than AVX256 on polyhedron. >> >> As a result, we prefer gcc 4.6 to generate 128-bit avx instructions only (for bdver1). >> >> The patch passed bootstrapping on x86_64-unknown-linux-gnu with "-O3 -g -march=bdver1" and >> the necessary correctness and performance. >> >> Is it OK to commit to trunk? > > I think there was no attempt to tune anything for AVX256, in particular > the vectorizer cost model may be completely off. HJ and Andi also > hinted at some alignment problems (at least SB seems to have a large > penalty when loads cross a cacheline boundary). So - did you do any > investigation on why 256bit vectors are slower for you? Are these > cases that the cost model could easily catch? IIRC from reading about bdver1 is that AVX256 was emulated by splitting them up into two different AVX128 instructions which obviously will be slower in some cases. -- Pinski
On Fri, Feb 11, 2011 at 1:46 AM, Richard Guenther <rguenther@suse.de> wrote: >> Attached is the patch to force gcc to generate 128-bit avx instructions for bdver1. We found that for >> the current Bulldozer processors, AVX128 performs better than AVX256. For example, AVX128 is 3% >> faster than AVX256 on CFP2006, and 2~3% faster than AVX256 on polyhedron. >> >> As a result, we prefer gcc 4.6 to generate 128-bit avx instructions only (for bdver1). >> >> The patch passed bootstrapping on x86_64-unknown-linux-gnu with "-O3 -g -march=bdver1" and >> the necessary correctness and performance. >> >> Is it OK to commit to trunk? > > I think there was no attempt to tune anything for AVX256, in particular > the vectorizer cost model may be completely off. HJ and Andi also > hinted at some alignment problems (at least SB seems to have a large > penalty when loads cross a cacheline boundary). So - did you do any > investigation on why 256bit vectors are slower for you? Are these > cases that the cost model could easily catch? >IIRC from reading about bdver1 is that AVX256 was emulated by >splitting them up into two different AVX128 instructions which >obviously will be slower in some cases. Yes, this should be the major reason that avx256 is slower. And HJ's patch that splitting unaligned 256-bit load/store does not help. We plan for gcc 4.6 to generate 128-bit avx for bdver1. It is true that we should tune the vectorizer for avx256 and avx128, but I am afraid that should be done in 4.7 frame. Thanks, Changpeng
From b2587889e4c8016f8bc4dde53fa0d59c1a9074da Mon Sep 17 00:00:00 2001 From: Changpeng Fang <chfang@houghton.(none)> Date: Thu, 10 Feb 2011 16:11:55 -0800 Subject: [PATCH] Generate 128-bit AVX instructions by default for bdver1 * config/i386/i386.h (enum ix86_tune_indices): Introduce X86_PREFER_AVX128 feature entry. (ix86_tune_features): Define TARGET_PREFER_AVX128. * config/i386/i386.c (initial_ix86_tune_features): Set X86_PREFER_AVX128 for bdver1. (ix86_preferred_simd_mode): Set the appropriate modes when X86_PREFER_AVX128 is set (for bdver1). --- gcc/config/i386/i386.c | 7 +++++-- gcc/config/i386/i386.h | 3 +++ 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 12c7062..5c8346e 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2082,6 +2082,9 @@ static unsigned int initial_ix86_tune_features[X86_TUNE_LAST] = { /* X86_TUNE_VECTORIZE_DOUBLE: Enable double precision vector instructions. */ ~m_ATOM, + + /* X86_PREFER_AVX128: Generate AVX 128 instead of AVX 256. */ + m_BDVER1, }; /* Feature tests against the various architecture variations. */ @@ -34698,9 +34701,9 @@ ix86_preferred_simd_mode (enum machine_mode mode) switch (mode) { case SFmode: - return TARGET_AVX ? V8SFmode : V4SFmode; + return TARGET_AVX ? (TARGET_PREFER_AVX128 ? V4SFmode : V8SFmode) : V4SFmode; case DFmode: - return TARGET_AVX ? V4DFmode : V2DFmode; + return TARGET_AVX ? (TARGET_PREFER_AVX128 ? V2DFmode : V4DFmode) : V2DFmode; case DImode: return V2DImode; case SImode: diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index f14a95d..b84e6ed 100644 --- a/gcc/config/i386/i386.h +++ b/gcc/config/i386/i386.h @@ -322,6 +322,7 @@ enum ix86_tune_indices { X86_TUNE_FUSE_CMP_AND_BRANCH, X86_TUNE_OPT_AGU, X86_TUNE_VECTORIZE_DOUBLE, + X86_PREFER_AVX128, X86_TUNE_LAST }; @@ -418,6 +419,8 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST]; #define TARGET_OPT_AGU ix86_tune_features[X86_TUNE_OPT_AGU] #define TARGET_VECTORIZE_DOUBLE \ ix86_tune_features[X86_TUNE_VECTORIZE_DOUBLE] +#define TARGET_PREFER_AVX128 \ + ix86_tune_features[X86_PREFER_AVX128] /* Feature tests against the various architecture variations. */ enum ix86_arch_indices { -- 1.6.3.3