Message ID | D4C76825A6780047854A11E93CDE84D02F7743@SAUSEXMBP01.amd.com |
---|---|
State | New |
Headers | show |
On Tue, Jun 15, 2010 at 2:34 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote: > > Hi, > > This patch serves as a proposal to turn on software prefetching by default at -O3. > > Software prefetching pass has been introduced into gcc for a long time. It has been > observed that the prefetch pass is quite stable now and can noticeably improve > program performance where locality is a concern. As a result, we think it is time > to enable it under -O3 so that we can get the benefit out of it. > > We have collected the data that shows the impact of prefetch on cpu2006 performance > under -O3 with gcc 4.6. There is a ~2% improvement on both integer and floating programs > and there is no apparent degradation. Programs listed below are those that have at least > (+/-)1% variation: > 434.zeusmp (+1.77%), 436.cactusADM (+2.4%), 450.soplex (+1.28%), > 459.GemsFDTD (+5.48%), 470.lbm (+31.84%), 482.sphnix3 (+1.01%), > 458.sjeng (-1.27%), 462.libquantum (+19.23%). > > The patch passed bootstrapping with -O3 -g. We have observed 2 failues in regression tests: > (1) gcc.dg/torture/stackalign/setjmp-1.c -O3 -fomit-frame-pointer (internal compiler error) > This is bug 44503. It is a CFG problem associate with _builtin_prefetch and we have had > a fix for it. > (2) gcc.dg/tree-ssa/loop-19.c scan-tree-dump-times optimized "MEM.(base: &|symbol: )a," 2 > With prefetching turned on, the memory count is different. We can easily fix the test case after > this patch is accepted. > > We realize that prefetch in gcc still has room for improvement. But we think it is good enough > now to be turned on. > > Is it acceptable to commit this patch? > Have you measured performance impact on Intel Core i7?
>Have you measured performance impact on Intel Core i7?
Noy yet. I should have mentioned that the performance data I posted in based
on the test on amd-linux64 system and gcc 4.6 r160766. (which includes my
last change to fix tonto regression).
Thanks,
Changpeng
On Tue, Jun 15, 2010 at 3:10 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote: >>Have you measured performance impact on Intel Core i7? > > Noy yet. I should have mentioned that the performance data I posted in based > on the test on amd-linux64 system and gcc 4.6 r160766. (which includes my > last change to fix tonto regression). > Is that rate or speed? I would like to see numbers on Intel Core i7 if you are proposing enable it by default for x86. Thanks.
>On Tue, Jun 15, 2010 at 3:10 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote: >>>Have you measured performance impact on Intel Core i7? >> >> Noy yet. I should have mentioned that the performance data I posted in based >> on the test on amd-linux64 system and gcc 4.6 r160766. (which includes my >> last change to fix tonto regression). >> >Is that rate or speed? I would like to see numbers on Intel Core i7 if you are >proposing enable it by default for x86. Speed run. Hopefully prefetching could benefit all targets that support sw prefetch. Thanks, Changpeng
Fang, Changpeng wrote: >> Is that rate or speed? I would like to see numbers on Intel Core i7 if you are >> proposing enable it by default for x86. HJ, are you in position to test this on Core i7? Since you work at Intel, it seems like it might be easier for you to do it than for the folks at AMD to do it? Maxim, is this something that you could easily test on some of our embedded hardware, assuming that the machine descriptions define the relevant prefetch instructions? For example, I think that ARM, MIPS, and Power all have theses instructions in the ISA, and we could look at CoreMark numbers to at least convince ourselves that this does no harm. Thanks,
On Tue, Jun 15, 2010 at 3:47 PM, Mark Mitchell <mark@codesourcery.com> wrote: > Fang, Changpeng wrote: > >>> Is that rate or speed? I would like to see numbers on Intel Core i7 if you are >>> proposing enable it by default for x86. > > HJ, are you in position to test this on Core i7? Since you work at > Intel, it seems like it might be easier for you to do it than for the > folks at AMD to do it? We never tried software prefetch since Intel processors rarely need it. We will try it. It will take some times.
H.J. Lu wrote: > We never tried software prefetch since Intel processors rarely need > it. We will try it. It will take some times. OK. The other thing we can certainly do, once we get some input about how well it does on various architectures, is to enable it conditionally. There is no reason that -O3 has to enable the same things on all architectures. If prefetching is a win on AMD CPUs, then -O3 should enable it on AMD CPUs -- but if it hurts Intel CPUs, or MIPS CPUs, or whatever, then it shouldn't be enabled on those architectures. So, let's see how generalizable Changpeng's numbers are to other architectures, and then we can decide where the setting goes. I think there's no question from the numbers that -O3 should enable this optimization on AMD CPUs. Thanks,
Am Mittwoch 16 Juni 2010, 01:03:20 schrieb Mark Mitchell: > H.J. Lu wrote: > > > We never tried software prefetch since Intel processors rarely need > > it. We will try it. It will take some times. > > OK. > > The other thing we can certainly do, once we get some input about how > well it does on various architectures, is to enable it conditionally. > There is no reason that -O3 has to enable the same things on all > architectures. If prefetching is a win on AMD CPUs, then -O3 should > enable it on AMD CPUs -- but if it hurts Intel CPUs, or MIPS CPUs, or > whatever, then it shouldn't be enabled on those architectures. > > So, let's see how generalizable Changpeng's numbers are to other > architectures, and then we can decide where the setting goes. I think > there's no question from the numbers that -O3 should enable this > optimization on AMD CPUs. We have also seen an overall win (around +1% for int and float) with -fprefetch-loop-arrays on s390 and activated it on O3 (see config/s390/s390.c override_options) for 4.6. So we (s390) are fine with enabling loop prefetching in general for O3. Changpeng, can you confirm that -O3 -fno-prefetch-loop-arrays does indeed deactivate prefetching? Christian
Hi,
> This patch serves as a proposal to turn on software prefetching by default at -O3.
On S/390 I ran into a problem with the runtime of the prefetching pass when using
aggressive loop unrolling. There is an algorithm with quadratic runtime regarding the
memory references which causes a very long compile time. Christian already opened a
bugzilla for that (PR 44576). It probably would make sense to have a closer look at this
before enabling the pass by default. Perhaps we could simply limit the miss rate
computation to a certain amount of memory references?
Besides of that enabling the prefetching pass by default is ok with me. I think it would
help getting more people into improving it.
Bye,
-Andreas-
Hi,
> This patch serves as a proposal to turn on software prefetching by default at -O3.
We probably also should remove these warnings if -fprefetch-loop-arrays hasn't been
specified explicitly:
#ifndef HAVE_prefetch
if (flag_prefetch_loop_arrays)
{
warning (0, "-fprefetch-loop-arrays not supported for this target");
flag_prefetch_loop_arrays = 0;
}
#else
if (flag_prefetch_loop_arrays && !HAVE_prefetch)
{
warning (0, "-fprefetch-loop-arrays not supported for this target (try -march
switches)");
flag_prefetch_loop_arrays = 0;
}
#endif
Bye,
-Andreas-
On Tue, Jun 15, 2010 at 3:55 PM, H.J. Lu <hjl.tools@gmail.com> wrote: > On Tue, Jun 15, 2010 at 3:47 PM, Mark Mitchell <mark@codesourcery.com> wrote: >> Fang, Changpeng wrote: >> >>>> Is that rate or speed? I would like to see numbers on Intel Core i7 if you are >>>> proposing enable it by default for x86. >> >> HJ, are you in position to test this on Core i7? Since you work at >> Intel, it seems like it might be easier for you to do it than for the >> folks at AMD to do it? > > We never tried software prefetch since Intel processors rarely need > it. We will try it. It will take some times. > Here are what we got on Intel Core i7 64bit with -O3 vs -O3 -fprefetch-loop-arrays: 400.perlbench -0.369004% 401.bzip2 -3% 403.gcc -1.5748% 429.mcf 0.784314% 445.gobmk -1.28755% 456.hmmer -1.67364% 458.sjeng 0% 462.libquantum 2.57827% 464.h264ref -0.806452% 471.omnetpp -1.51515% 473.astar -0.581395% 483.xalancbmk 0% SPECint(R)_base2006 -0.766284% 410.bwaves -1.27796% 416.gamess -1.2605% 433.milc 0% 434.zeusmp 1.24481% 435.gromacs -0.478469% 436.cactusADM -5.07813% 437.leslie3d -1.10294% 444.namd 0% 450.soplex 0.628931% 453.povray 0.392157% 454.calculix 0% 459.GemsFDTD -1.51515% 465.tonto 0% 470.lbm -1.16279% 481.wrf -0.442478% 482.sphinx3 2.51397% SPECfp(R)_base2006 -0.769231% It doesn't help Intel Core i7. I think it should be enabled with -mtune=XXX on x86 where prefetch improves performance on XXX.
Am Samstag 19 Juni 2010, 00:07:07 schrieb H.J. Lu: > > We never tried software prefetch since Intel processors rarely need > > it. We will try it. It will take some times. > > > > Here are what we got on Intel Core i7 64bit with -O3 vs -O3 > -fprefetch-loop-arrays: > > 400.perlbench -0.369004% > 401.bzip2 -3% > 403.gcc -1.5748% > 429.mcf 0.784314% > 445.gobmk -1.28755% > 456.hmmer -1.67364% > 458.sjeng 0% > 462.libquantum 2.57827% > 464.h264ref -0.806452% > 471.omnetpp -1.51515% > 473.astar -0.581395% > 483.xalancbmk 0% > SPECint(R)_base2006 -0.766284% > 410.bwaves -1.27796% > 416.gamess -1.2605% > 433.milc 0% > 434.zeusmp 1.24481% > 435.gromacs -0.478469% > 436.cactusADM -5.07813% > 437.leslie3d -1.10294% > 444.namd 0% > 450.soplex 0.628931% > 453.povray 0.392157% > 454.calculix 0% > 459.GemsFDTD -1.51515% > 465.tonto 0% > 470.lbm -1.16279% > 481.wrf -0.442478% > 482.sphinx3 2.51397% > SPECfp(R)_base2006 -0.769231% > > It doesn't help Intel Core i7. I think it should be enabled > with -mtune=XXX on x86 where prefetch improves > performance on XXX. It also might be worth to investigate if overriding the parameters per -mtune=XXX results in an overall win for -fprefetch-loop-arrays. We did that on s390 since the default values were not ideal: e.g. we have several of these [...] if (!PARAM_SET_P (PARAM_PREFETCH_MIN_INSN_TO_MEM_RATIO)) set_param_value ("prefetch-min-insn-to-mem-ratio", 2); if (!PARAM_SET_P (PARAM_SIMULTANEOUS_PREFETCHES)) set_param_value ("simultaneous-prefetches", 6); [...] in override_options. e.g. if software prefetch is expensive you could make it happen less often for core i7 or vice versa. Christian
Christian Borntraeger wrote: > It also might be worth to investigate if overriding the parameters per > -mtune=XXX results in an overall win for -fprefetch-loop-arrays. We did > that on s390 since the default values were not ideal Yes, that might be a good idea for i7. But, in the meantime, I think we should get a version of the patch that turns on prefetching on AMD CPUs with -O3. There's no reason to demand consistency for all CPUs and it clearly benefits the AMD CPUs. Changpeng, would you please submit a patch that activates this optimization only with tuning for AMD CPUs? Thanks,
Mark Mitchell <mark@codesourcery.com> writes: > > Yes, that might be a good idea for i7. > > But, in the meantime, I think we should get a version of the patch that > turns on prefetching on AMD CPUs with -O3. There's no reason to demand > consistency for all CPUs and it clearly benefits the AMD CPUs. > Changpeng, would you please submit a patch that activates this > optimization only with tuning for AMD CPUs? On i7 it will be likely still a win if hardware prefetch is disabled in the BIOS. Most BIOS have a setting like this. Unfortunately this cannot be easily detected by a program. -Andi
Hi, Mark: >> It also might be worth to investigate if overriding the parameters per >> -mtune=XXX results in an overall win for -fprefetch-loop-arrays. We did >> that on s390 since the default values were not ideal >Yes, that might be a good idea for i7. >But, in the meantime, I think we should get a version of the patch that >turns on prefetching on AMD CPUs with -O3. There's no reason to demand >consistency for all CPUs and it clearly benefits the AMD CPUs. >Changpeng, would you please submit a patch that activates this >optimization only with tuning for AMD CPUs? It seems I have a problem in setting prefetching default at -O3 only for AMD CPUs. When we know it is AMD CPUS after the command line options have been parsed, we don't know whether -fno-prefetch-loop-arrays has been specified in the command line or not. If we turn on prefetching at -O3, then we could not explicitly turn it off if we want. What I want is something like: if (!OPTION_SET_P (flag_prefetch_loop_arrays)) flag_prefetch_loop_arrays = 1; An alternative way is parsing -m options before other options. What do you think? Thanks, Changpeng
Am Mittwoch 23 Juni 2010, 00:05:57 schrieb Fang, Changpeng: > Hi, Mark: > > >> It also might be worth to investigate if overriding the parameters per > >> -mtune=XXX results in an overall win for -fprefetch-loop-arrays. We did > >> that on s390 since the default values were not ideal > > >Yes, that might be a good idea for i7. > > >But, in the meantime, I think we should get a version of the patch that > >turns on prefetching on AMD CPUs with -O3. There's no reason to demand > >consistency for all CPUs and it clearly benefits the AMD CPUs. > >Changpeng, would you please submit a patch that activates this > >optimization only with tuning for AMD CPUs? > > It seems I have a problem in setting prefetching default at -O3 only for AMD CPUs. > When we know it is AMD CPUS after the command line options have been parsed, > we don't know whether -fno-prefetch-loop-arrays has been specified in the command line > or not. If we turn on prefetching at -O3, then we could not explicitly turn it off if we want. > > What I want is something like: > if (!OPTION_SET_P (flag_prefetch_loop_arrays)) > flag_prefetch_loop_arrays = 1; I think having an OPTION_SET_P (or maybe name that FLAG_SET_P) makes a lot of sense and would match the PARAM_SET_P way of doing things. Christian
On Wed, 23 Jun 2010, Christian Borntraeger wrote: > Am Mittwoch 23 Juni 2010, 00:05:57 schrieb Fang, Changpeng: > > Hi, Mark: > > > > >> It also might be worth to investigate if overriding the parameters per > > >> -mtune=XXX results in an overall win for -fprefetch-loop-arrays. We did > > >> that on s390 since the default values were not ideal > > > > >Yes, that might be a good idea for i7. > > > > >But, in the meantime, I think we should get a version of the patch that > > >turns on prefetching on AMD CPUs with -O3. There's no reason to demand > > >consistency for all CPUs and it clearly benefits the AMD CPUs. > > >Changpeng, would you please submit a patch that activates this > > >optimization only with tuning for AMD CPUs? > > > > It seems I have a problem in setting prefetching default at -O3 only for AMD CPUs. > > When we know it is AMD CPUS after the command line options have been parsed, > > we don't know whether -fno-prefetch-loop-arrays has been specified in the command line > > or not. If we turn on prefetching at -O3, then we could not explicitly turn it off if we want. > > > > What I want is something like: > > if (!OPTION_SET_P (flag_prefetch_loop_arrays)) > > flag_prefetch_loop_arrays = 1; > > I think having an OPTION_SET_P (or maybe name that FLAG_SET_P) makes a lot of sense and > would match the PARAM_SET_P way of doing things. We use tri-states for this elsewhere in the compiler. Richard.
On Wed, 23 Jun 2010, Richard Guenther wrote: > > > What I want is something like: > > > if (!OPTION_SET_P (flag_prefetch_loop_arrays)) > > > flag_prefetch_loop_arrays = 1; > > > > I think having an OPTION_SET_P (or maybe name that FLAG_SET_P) makes a lot of sense and > > would match the PARAM_SET_P way of doing things. > > We use tri-states for this elsewhere in the compiler. We use a mixture of tri-states and separate *_set variables. When my patch series for option handling and multilib selection reaches the point of an options structure replacing separate variables for each option, I expect to use a separate such structure to record which options have been set explicitly. That is: if (!global_options_explicit.opt.prefetch_loop_arrays) global_options.opt.prefetch_loop_arrays = 1; (but hopefully using pointers to the relevant structures rather than hardcoding global_options and global_options_explicit).
Am Mittwoch 23 Juni 2010, 10:50:35 schrieb Richard Guenther: > > > What I want is something like: > > > if (!OPTION_SET_P (flag_prefetch_loop_arrays)) > > > flag_prefetch_loop_arrays = 1; > > > > I think having an OPTION_SET_P (or maybe name that FLAG_SET_P) makes a lot of sense and > > would match the PARAM_SET_P way of doing things. > > We use tri-states for this elsewhere in the compiler. So something like the following unfinished patch would be the (current) way to go? Christian Index: gcc/common.opt =================================================================== *** gcc/common.opt.orig --- gcc/common.opt *************** Common Report Var(flag_predictive_common *** 945,951 **** Run predictive commoning optimization. fprefetch-loop-arrays ! Common Report Var(flag_prefetch_loop_arrays) Optimization Generate prefetch instructions, if available, for arrays in loops fprofile --- 945,951 ---- Run predictive commoning optimization. fprefetch-loop-arrays ! Common Report Var(flag_prefetch_loop_arrays) Init(2) Optimization Generate prefetch instructions, if available, for arrays in loops fprofile Index: gcc/config/s390/s390.c =================================================================== *** gcc/config/s390/s390.c.orig --- gcc/config/s390/s390.c *************** override_options (void) *** 1676,1682 **** /* This cannot reside in optimization_options since HAVE_prefetch requires the arch flags to be evaluated already. */ ! if (HAVE_prefetch && optimize >= 3) flag_prefetch_loop_arrays = 1; } --- 1676,1682 ---- /* This cannot reside in optimization_options since HAVE_prefetch requires the arch flags to be evaluated already. */ ! if (HAVE_prefetch && optimize >= 3 && flag_prefetch_loop_arrays == 2) flag_prefetch_loop_arrays = 1; } Index: gcc/toplev.c =================================================================== *** gcc/toplev.c.orig --- gcc/toplev.c *************** process_options (void) *** 2048,2053 **** --- 2048,2056 ---- } #ifndef HAVE_prefetch + if (flag_prefetch_loop_arrays == 2) + flag_prefetch_loop_arrays = 0; + if (flag_prefetch_loop_arrays) { warning (0, "-fprefetch-loop-arrays not supported for this target");
> > fprefetch-loop-arrays > ! Common Report Var(flag_prefetch_loop_arrays) Init(2) Optimization For consistency with other options, I would suggest to use Init(-1) for an "uninitialized" option. As Joseph said, this should be done differently in the future, but for now using -1 is the usual setting. I didn't understand how (or if) Joseph will implement the framework to set default values. As I understand it, Init(X) should only set the value X to the Var if it is not set by anything else before. Then, we can move the default values back to the *.opt files. Otherwise, such a patch would be an improvement. Cheers, Manuel.
From 50ef9b1dd700ace9854f88814488b7807fcbae1c Mon Sep 17 00:00:00 2001 From: Changpeng Fang <chfang@pathscale.(none)> Date: Thu, 10 Jun 2010 14:52:15 -0700 Subject: [PATCH 2/2] Enable -fprefetch-loop-arrays under -O3 *opts.c (decode_options): Turn on flag_prefetch_loop_arrays under -O3. *doc/invoke.texi(@item -O3): Say that -O3 includes -fprefetch-loop-arrays. (@item -fprefetch-loop-arrays): Say that -fprefetch-loop-arrays is enabled by default under -O3. --- gcc/doc/invoke.texi | 6 ++++-- gcc/opts.c | 1 + 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index d8c0c22..1c28a91 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -5883,7 +5883,8 @@ invoking @option{-O2} on programs that use computed gotos. Optimize yet more. @option{-O3} turns on all optimizations specified by @option{-O2} and also turns on the @option{-finline-functions}, @option{-funswitch-loops}, @option{-fpredictive-commoning}, -@option{-fgcse-after-reload} and @option{-ftree-vectorize} options. +@option{-fgcse-after-reload}, @option{-ftree-vectorize} and +@option{-fprefetch-loop-arrays} options. @item -O0 @opindex O0 @@ -7037,7 +7038,8 @@ memory to improve the performance of loops that access large arrays. This option may generate better or worse code; results are highly dependent on the structure of loops within the source code. -Disabled at level @option{-Os}. +This flag is enabled by default at @option{-O3} and disabled at +level @option{-Os}. @item -fno-peephole @itemx -fno-peephole2 diff --git a/gcc/opts.c b/gcc/opts.c index 8699ec3..7814341 100644 --- a/gcc/opts.c +++ b/gcc/opts.c @@ -951,6 +951,7 @@ decode_options (unsigned int argc, const char **argv) flag_unswitch_loops = opt3; flag_gcse_after_reload = opt3; flag_tree_vectorize = opt3; + flag_prefetch_loop_arrays = opt3; flag_ipa_cp_clone = opt3; if (flag_ipa_cp_clone) flag_ipa_cp = 1; -- 1.6.3.3