Message ID | 20240606101043.3682477-1-manolis.tsamis@vrull.eu |
---|---|
State | New |
Headers | show |
Series | [v2] Target-independent store forwarding avoidance. | expand |
On 6/6/24 4:10 AM, Manolis Tsamis wrote: > This pass detects cases of expensive store forwarding and tries to avoid them > by reordering the stores and using suitable bit insertion sequences. > For example it can transform this: > > strb w2, [x1, 1] > ldr x0, [x1] # Expensive store forwarding to larger load. > > To: > > ldr x0, [x1] > strb w2, [x1] > bfi x0, x2, 0, 8 > > Assembly like this can appear with bitfields or type punning / unions. > On stress-ng when running the cpu-union microbenchmark the following speedups > have been observed. > > Neoverse-N1: +29.4% > Intel Coffeelake: +13.1% > AMD 5950X: +17.5% > > gcc/ChangeLog: > > * Makefile.in: Add avoid-store-forwarding.o. > * common.opt: New option -favoid-store-forwarding. > * params.opt: New param store-forwarding-max-distance. > * doc/invoke.texi: Document new pass. > * doc/passes.texi: Document new pass. > * passes.def: Schedule a new pass. > * tree-pass.h (make_pass_rtl_avoid_store_forwarding): Declare. > * avoid-store-forwarding.cc: New file. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/avoid-store-forwarding-1.c: New test. > * gcc.target/aarch64/avoid-store-forwarding-2.c: New test. > * gcc.target/aarch64/avoid-store-forwarding-3.c: New test. > * gcc.target/aarch64/avoid-store-forwarding-4.c: New test. So this is getting a lot more interesting. I think the first time I looked at this it was more concerned with stores feeding something like a load-pair and avoiding the store forwarding penalty for that case. Am I mis-remembering, or did it get significantly more general? > + > +static unsigned int stats_sf_detected = 0; > +static unsigned int stats_sf_avoided = 0; > + > +static rtx > +get_load_mem (rtx expr) Needs a function comment. You should probably mention that EXPR must be a single_set in that comment. + > + rtx dest; > + if (eliminate_load) > + dest = gen_reg_rtx (load_inner_mode); > + else > + dest = SET_DEST (load); > + > + int move_to_front = -1; > + int total_cost = 0; > + > + /* Check if we can emit bit insert instructions for all forwarded stores. */ > + FOR_EACH_VEC_ELT (stores, i, it) > + { > + it->mov_reg = gen_reg_rtx (GET_MODE (it->store_mem)); > + rtx_insn *insns = NULL; > + > + /* If we're eliminating the load then find the store with zero offset > + and use it as the base register to avoid a bit insert. */ > + if (eliminate_load && it->offset == 0) So often is this triggering? We have various codes in the gimple optimizers to detect store followed by a load from the same address and do the forwarding. If they're happening with any frequency that would be a good sign code in DOM and elsewhere isn't working well. THe way these passes detect this case is to take store, flip the operands around (ie, it looks like a load) and enter that into the expression hash tables. After that standard redundancy elimination approaches will work. > + { > + start_sequence (); > + > + /* We can use a paradoxical subreg to force this to a wider mode, as > + the only use will be inserting the bits (i.e., we don't care about > + the value of the higher bits). */ Which may be a good hint about the cases you're capturing -- if the modes/sizes differ that would make more sense since I don't think we're as likely to be capturing those cases. > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > index 4e8967fd8ab..c769744d178 100644 > --- a/gcc/doc/invoke.texi > +++ b/gcc/doc/invoke.texi > @@ -12657,6 +12657,15 @@ loop unrolling. > This option is enabled by default at optimization levels @option{-O1}, > @option{-O2}, @option{-O3}, @option{-Os}. > > +@opindex favoid-store-forwarding > +@item -favoid-store-forwarding > +@itemx -fno-avoid-store-forwarding > +Many CPUs will stall for many cycles when a load partially depends on previous > +smaller stores. This pass tries to detect such cases and avoid the penalty by > +changing the order of the load and store and then fixing up the loaded value. > + > +Disabled by default. Is there any particular reason why this would be off by default at -O1 or higher? It would seem to me that on modern cores that this transformation should easily be a win. Even on an old in-order core, avoiding the load with the bit insert is likely profitable, just not as much so. > diff --git a/gcc/params.opt b/gcc/params.opt > index d34ef545bf0..b8115f5c27a 100644 > --- a/gcc/params.opt > +++ b/gcc/params.opt > @@ -1032,6 +1032,10 @@ Allow the store merging pass to introduce unaligned stores if it is legal to do > Common Joined UInteger Var(param_store_merging_max_size) Init(65536) IntegerRange(1, 65536) Param Optimization > Maximum size of a single store merging region in bytes. > > +-param=store-forwarding-max-distance= > +Common Joined UInteger Var(param_store_forwarding_max_distance) Init(10) IntegerRange(1, 1000) Param Optimization > +Maximum number of instruction distance that a small store forwarded to a larger load may stall. I think you may need to run the update-urls script since you've added a new option. In general it seems pretty reasonable. I've actually added it to my tester just to see if there's any fallout. It'll take a week to churn through the long running targets that bootstrap in QEMU, but the crosses should have data Monday. jeff
On 6/7/24 4:31 PM, Jeff Law wrote: > > I've actually added it to my tester just to see if there's any fallout. > It'll take a week to churn through the long running targets that > bootstrap in QEMU, but the crosses should have data Monday. The first round naturally didn't trigger anything because the option is off by default. So I twiddled it to be on at -O1 and above. epiphany-elf ICEs in gen_rtx_SUBREG with the attached .i file compiled with -O2: > root@577c7458c93a://home/jlaw/jenkins/workspace/epiphany-elf/epiphany-elf-obj/newlib/epiphany-elf/newlib/libm/complex# epiphany-elf-gcc -O2 libm_a-cacos.i > during RTL pass: avoid_store_forwarding > ../../../..//newlib-cygwin/newlib/libm/complex/cacos.c: In function 'cacos': > ../../../..//newlib-cygwin/newlib/libm/complex/cacos.c:99:1: internal compiler error: in gen_rtx_SUBREG, at emit-rtl.cc:1032 > 0x614538 gen_rtx_SUBREG(machine_mode, rtx_def*, poly_int<1u, unsigned long>) > ../../..//gcc/gcc/emit-rtl.cc:1032 > 0x614538 gen_rtx_SUBREG(machine_mode, rtx_def*, poly_int<1u, unsigned long>) > ../../..//gcc/gcc/emit-rtl.cc:1030 > 0xe82216 process_forwardings > ../../..//gcc/gcc/avoid-store-forwarding.cc:273 > 0xe82216 avoid_store_forwarding > ../../..//gcc/gcc/avoid-store-forwarding.cc:489 > 0xe82667 execute > ../../..//gcc/gcc/avoid-store-forwarding.cc:558 ft32-elf ICE'd in bitmap_check_index at various optimization levels: > FAIL: execute/pr108498-2.c -O1 (internal compiler error: in bitmap_check_index, at sbitmap.h:104) > FAIL: execute/pr108498-2.c -O1 (test for excess errors) > FAIL: execute/pr108498-2.c -O2 (internal compiler error: in bitmap_check_index, at sbitmap.h:104) > FAIL: execute/pr108498-2.c -O2 (test for excess errors) > FAIL: execute/pr108498-2.c -O3 -g (internal compiler error: in bitmap_check_index, at sbitmap.h:104) > FAIL: execute/pr108498-2.c -O3 -g (test for excess errors) avr, c6x, lm32-elf failed to build libgcc with an ICE in leaf_function_p, I haven't isolated that yet. There were other failures as well. But you've got a few to start with and we can retest pretty easily as the patch evolves. jeff # 0 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c" # 1 "//home/jlaw/jenkins/workspace/epiphany-elf/epiphany-elf-obj/newlib/epiphany-elf/newlib//" # 0 "<built-in>" # 0 "<command-line>" # 1 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c" # 77 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c" # 1 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/complex.h" 1 3 4 # 15 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/complex.h" 3 4 # 1 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/cdefs.h" 1 3 4 # 45 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/cdefs.h" 3 4 # 1 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_default_types.h" 1 3 4 # 1 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/features.h" 1 3 4 # 28 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/features.h" 3 4 # 1 "./_newlib_version.h" 1 3 4 # 29 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/features.h" 2 3 4 # 9 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_default_types.h" 2 3 4 # 41 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_default_types.h" 3 4 # 41 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_default_types.h" 3 4 typedef signed char __int8_t; typedef unsigned char __uint8_t; # 55 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_default_types.h" 3 4 typedef short int __int16_t; typedef short unsigned int __uint16_t; # 77 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_default_types.h" 3 4 typedef long int __int32_t; typedef long unsigned int __uint32_t; # 103 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_default_types.h" 3 4 typedef long long int __int64_t; typedef long long unsigned int __uint64_t; # 134 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_default_types.h" 3 4 typedef signed char __int_least8_t; typedef unsigned char __uint_least8_t; # 160 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_default_types.h" 3 4 typedef short int __int_least16_t; typedef short unsigned int __uint_least16_t; # 182 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_default_types.h" 3 4 typedef long int __int_least32_t; typedef long unsigned int __uint_least32_t; # 200 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_default_types.h" 3 4 typedef long long int __int_least64_t; typedef long long unsigned int __uint_least64_t; # 214 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_default_types.h" 3 4 typedef long long int __intmax_t; typedef long long unsigned int __uintmax_t; typedef long int __intptr_t; typedef long unsigned int __uintptr_t; # 46 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/cdefs.h" 2 3 4 # 1 "/home/jlaw/jenkins/workspace/epiphany-elf/epiphany-elf-installed/lib/gcc/epiphany-elf/15.0.0/include/stddef.h" 1 3 4 # 145 "/home/jlaw/jenkins/workspace/epiphany-elf/epiphany-elf-installed/lib/gcc/epiphany-elf/15.0.0/include/stddef.h" 3 4 typedef long int ptrdiff_t; # 214 "/home/jlaw/jenkins/workspace/epiphany-elf/epiphany-elf-installed/lib/gcc/epiphany-elf/15.0.0/include/stddef.h" 3 4 typedef long unsigned int size_t; # 329 "/home/jlaw/jenkins/workspace/epiphany-elf/epiphany-elf-installed/lib/gcc/epiphany-elf/15.0.0/include/stddef.h" 3 4 typedef unsigned int wchar_t; # 425 "/home/jlaw/jenkins/workspace/epiphany-elf/epiphany-elf-installed/lib/gcc/epiphany-elf/15.0.0/include/stddef.h" 3 4 typedef struct { long long __max_align_ll __attribute__((__aligned__(__alignof__(long long)))); long double __max_align_ld __attribute__((__aligned__(__alignof__(long double)))); # 436 "/home/jlaw/jenkins/workspace/epiphany-elf/epiphany-elf-installed/lib/gcc/epiphany-elf/15.0.0/include/stddef.h" 3 4 } max_align_t; # 48 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/cdefs.h" 2 3 4 # 16 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/complex.h" 2 3 4 double _Complex cacos(double _Complex); float _Complex cacosf(float _Complex); long double _Complex cacosl(long double _Complex); double _Complex casin(double _Complex); float _Complex casinf(float _Complex); long double _Complex casinl(long double _Complex); double _Complex catan(double _Complex); float _Complex catanf(float _Complex); long double _Complex catanl(long double _Complex); double _Complex ccos(double _Complex); float _Complex ccosf(float _Complex); long double _Complex ccosl(long double _Complex); double _Complex csin(double _Complex); float _Complex csinf(float _Complex); long double _Complex csinl(long double _Complex); double _Complex ctan(double _Complex); float _Complex ctanf(float _Complex); long double _Complex ctanl(long double _Complex); double _Complex cacosh(double _Complex); float _Complex cacoshf(float _Complex); long double _Complex cacoshl(long double _Complex); double _Complex casinh(double _Complex); float _Complex casinhf(float _Complex); long double _Complex casinhl(long double _Complex); double _Complex catanh(double _Complex); float _Complex catanhf(float _Complex); long double _Complex catanhl(long double _Complex); double _Complex ccosh(double _Complex); float _Complex ccoshf(float _Complex); long double _Complex ccoshl(long double _Complex); double _Complex csinh(double _Complex); float _Complex csinhf(float _Complex); long double _Complex csinhl(long double _Complex); double _Complex ctanh(double _Complex); float _Complex ctanhf(float _Complex); long double _Complex ctanhl(long double _Complex); double _Complex cexp(double _Complex); float _Complex cexpf(float _Complex); long double _Complex cexpl(long double _Complex); double _Complex clog(double _Complex); float _Complex clogf(float _Complex); long double _Complex clogl(long double _Complex); # 100 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/complex.h" 3 4 long double cabsl(long double _Complex) ; double cabs(double _Complex) ; float cabsf(float _Complex) ; double _Complex cpow(double _Complex, double _Complex); float _Complex cpowf(float _Complex, float _Complex); long double _Complex cpowl(long double _Complex, long double _Complex); double _Complex csqrt(double _Complex); float _Complex csqrtf(float _Complex); long double _Complex csqrtl(long double _Complex); double carg(double _Complex); float cargf(float _Complex); long double cargl(long double _Complex); double cimag(double _Complex); float cimagf(float _Complex); long double cimagl(long double _Complex); double _Complex conj(double _Complex); float _Complex conjf(float _Complex); long double _Complex conjl(long double _Complex); double _Complex cproj(double _Complex); float _Complex cprojf(float _Complex); long double _Complex cprojl(long double _Complex); double creal(double _Complex); float crealf(float _Complex); long double creall(long double _Complex); # 148 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/complex.h" 3 4 # 78 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c" 2 # 1 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/math.h" 1 3 4 # 1 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/reent.h" 1 3 4 # 13 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/reent.h" 3 4 # 1 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/_ansi.h" 1 3 4 # 10 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/_ansi.h" 3 4 # 1 "./newlib.h" 1 3 4 # 11 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/_ansi.h" 2 3 4 # 1 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/config.h" 1 3 4 # 1 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/ieeefp.h" 1 3 4 # 5 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/config.h" 2 3 4 # 12 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/_ansi.h" 2 3 4 # 14 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/reent.h" 2 3 4 # 1 "/home/jlaw/jenkins/workspace/epiphany-elf/epiphany-elf-installed/lib/gcc/epiphany-elf/15.0.0/include/stddef.h" 1 3 4 # 15 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/reent.h" 2 3 4 # 1 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/_types.h" 1 3 4 # 24 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/_types.h" 3 4 # 1 "/home/jlaw/jenkins/workspace/epiphany-elf/epiphany-elf-installed/lib/gcc/epiphany-elf/15.0.0/include/stddef.h" 1 3 4 # 359 "/home/jlaw/jenkins/workspace/epiphany-elf/epiphany-elf-installed/lib/gcc/epiphany-elf/15.0.0/include/stddef.h" 3 4 typedef unsigned int wint_t; # 25 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/_types.h" 2 3 4 # 1 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_types.h" 1 3 4 # 28 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/_types.h" 2 3 4 typedef long __blkcnt_t; typedef long __blksize_t; typedef __uint64_t __fsblkcnt_t; typedef __uint32_t __fsfilcnt_t; typedef long _off_t; typedef int __pid_t; typedef short __dev_t; typedef unsigned short __uid_t; typedef unsigned short __gid_t; typedef __uint32_t __id_t; typedef unsigned short __ino_t; # 90 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/_types.h" 3 4 typedef __uint32_t __mode_t; __extension__ typedef long long _off64_t; typedef _off_t __off_t; typedef _off64_t __loff_t; typedef long __key_t; typedef long _fpos_t; # 131 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/_types.h" 3 4 typedef long unsigned int __size_t; # 147 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/_types.h" 3 4 typedef long signed int _ssize_t; # 158 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/_types.h" 3 4 typedef _ssize_t __ssize_t; typedef struct { int __count; union { wint_t __wch; unsigned char __wchb[4]; } __value; } _mbstate_t; typedef void *_iconv_t; typedef unsigned long __clock_t; typedef __int_least64_t __time_t; typedef unsigned long __clockid_t; typedef long __daddr_t; typedef unsigned long __timer_t; typedef __uint8_t __sa_family_t; typedef __uint32_t __socklen_t; typedef int __nl_item; typedef unsigned short __nlink_t; typedef long __suseconds_t; typedef unsigned long __useconds_t; typedef __builtin_va_list __va_list; # 17 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/reent.h" 2 3 4 typedef unsigned long __ULong; # 35 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/reent.h" 3 4 # 1 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/lock.h" 1 3 4 # 11 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/lock.h" 3 4 typedef int _LOCK_T; typedef int _LOCK_RECURSIVE_T; # 36 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/reent.h" 2 3 4 typedef _LOCK_RECURSIVE_T _flock_t; struct _reent; struct __locale_t; struct _Bigint { struct _Bigint *_next; int _k, _maxwds, _sign, _wds; __ULong _x[1]; }; struct __tm { int __tm_sec; int __tm_min; int __tm_hour; int __tm_mday; int __tm_mon; int __tm_year; int __tm_wday; int __tm_yday; int __tm_isdst; }; struct _on_exit_args { void * _fnargs[32]; void * _dso_handle[32]; __ULong _fntypes; __ULong _is_cxa; }; # 99 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/reent.h" 3 4 struct _atexit { struct _atexit *_next; int _ind; void (*_fns[32])(void); struct _on_exit_args _on_exit_args; }; # 116 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/reent.h" 3 4 struct __sbuf { unsigned char *_base; int _size; }; # 153 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/reent.h" 3 4 struct __sFILE { unsigned char *_p; int _r; int _w; short _flags; short _file; struct __sbuf _bf; int _lbfsize; void * _cookie; int (*_read) (struct _reent *, void *, char *, int); int (*_write) (struct _reent *, void *, const char *, int); _fpos_t (*_seek) (struct _reent *, void *, _fpos_t, int); int (*_close) (struct _reent *, void *); struct __sbuf _ub; unsigned char *_up; int _ur; unsigned char _ubuf[3]; unsigned char _nbuf[1]; struct __sbuf _lb; int _blksize; _off_t _offset; struct _reent *_data; _flock_t _lock; _mbstate_t _mbstate; int _flags2; }; # 270 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/reent.h" 3 4 typedef struct __sFILE __FILE; extern __FILE __sf[3]; struct _glue { struct _glue *_next; int _niobs; __FILE *_iobs; }; extern struct _glue __sglue; # 306 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/reent.h" 3 4 struct _rand48 { unsigned short _seed[3]; unsigned short _mult[3]; unsigned short _add; }; # 578 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/reent.h" 3 4 struct _reent { int _errno; __FILE *_stdin, *_stdout, *_stderr; int _inc; char _emergency[25]; struct __locale_t *_locale; void (*__cleanup) (struct _reent *); struct _Bigint *_result; int _result_k; struct _Bigint *_p5s; struct _Bigint **_freelist; int _cvtlen; char *_cvtbuf; union { struct { char * _strtok_last; char _asctime_buf[26]; struct __tm _localtime_buf; int _gamma_signgam; __extension__ unsigned long long _rand_next; struct _rand48 _r48; _mbstate_t _mblen_state; _mbstate_t _mbtowc_state; _mbstate_t _wctomb_state; char _l64a_buf[8]; char _signal_buf[24]; int _getdate_err; _mbstate_t _mbrlen_state; _mbstate_t _mbrtowc_state; _mbstate_t _mbsrtowcs_state; _mbstate_t _wcrtomb_state; _mbstate_t _wcsrtombs_state; int _h_errno; # 647 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/reent.h" 3 4 char _getlocalename_l_buf[32 ]; } _reent; } _new; void (**_sig_func)(int); }; # 797 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/reent.h" 3 4 extern struct _reent *_impure_ptr ; extern struct _reent _impure_data ; # 917 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/reent.h" 3 4 extern struct _atexit *__atexit; extern struct _atexit __atexit0; extern void (*__stdio_exit_handler) (void); void _reclaim_reent (struct _reent *); extern int _fwalk_sglue (struct _reent *, int (*)(struct _reent *, __FILE *), struct _glue *); # 6 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/math.h" 2 3 4 # 1 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/ieeefp.h" 1 3 4 # 8 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/math.h" 2 3 4 # 1 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/_ansi.h" 1 3 4 # 9 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/math.h" 2 3 4 # 86 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/math.h" 3 4 extern double atan (double); extern double cos (double); extern double sin (double); extern double tan (double); extern double tanh (double); extern double frexp (double, int *); extern double modf (double, double *); extern double ceil (double); extern double fabs (double); extern double floor (double); extern double acos (double); extern double asin (double); extern double atan2 (double, double); extern double cosh (double); extern double sinh (double); extern double exp (double); extern double ldexp (double, int); extern double log (double); extern double log10 (double); extern double pow (double, double); extern double sqrt (double); extern double fmod (double, double); extern int finite (double); extern int finitef (float); extern int finitel (long double); extern int isinff (float); extern int isnanf (float); extern int isinf (double); extern int isnan (double); # 160 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/math.h" 3 4 typedef float float_t; typedef double double_t; # 223 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/math.h" 3 4 extern int __isinff (float); extern int __isinfd (double); extern int __isnanf (float); extern int __isnand (double); extern int __fpclassifyf (float); extern int __fpclassifyd (double); extern int __signbitf (float); extern int __signbitd (double); # 319 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/math.h" 3 4 extern double infinity (void); extern double nan (const char *); extern double copysign (double, double); extern double logb (double); extern int ilogb (double); extern double asinh (double); extern double cbrt (double); extern double nextafter (double, double); extern double rint (double); extern double scalbn (double, int); extern double exp2 (double); extern double scalbln (double, long int); extern double tgamma (double); extern double nearbyint (double); extern long int lrint (double); extern long long int llrint (double); extern double round (double); extern long int lround (double); extern long long int llround (double); extern double trunc (double); extern double remquo (double, double, int *); extern double fdim (double, double); extern double fmax (double, double); extern double fmin (double, double); extern double fma (double, double, double); extern double log1p (double); extern double expm1 (double); extern double acosh (double); extern double atanh (double); extern double remainder (double, double); extern double gamma (double); extern double lgamma (double); extern double erf (double); extern double erfc (double); extern double log2 (double); extern double hypot (double, double); extern float atanf (float); extern float cosf (float); extern float sinf (float); extern float tanf (float); extern float tanhf (float); extern float frexpf (float, int *); extern float modff (float, float *); extern float ceilf (float); extern float fabsf (float); extern float floorf (float); extern float acosf (float); extern float asinf (float); extern float atan2f (float, float); extern float coshf (float); extern float sinhf (float); extern float expf (float); extern float ldexpf (float, int); extern float logf (float); extern float log10f (float); extern float powf (float, float); extern float sqrtf (float); extern float fmodf (float, float); extern float exp2f (float); extern float scalblnf (float, long int); extern float tgammaf (float); extern float nearbyintf (float); extern long int lrintf (float); extern long long int llrintf (float); extern float roundf (float); extern long int lroundf (float); extern long long int llroundf (float); extern float truncf (float); extern float remquof (float, float, int *); extern float fdimf (float, float); extern float fmaxf (float, float); extern float fminf (float, float); extern float fmaf (float, float, float); extern float infinityf (void); extern float nanf (const char *); extern float copysignf (float, float); extern float logbf (float); extern int ilogbf (float); extern float asinhf (float); extern float cbrtf (float); extern float nextafterf (float, float); extern float rintf (float); extern float scalbnf (float, int); extern float log1pf (float); extern float expm1f (float); extern float acoshf (float); extern float atanhf (float); extern float remainderf (float, float); extern float gammaf (float); extern float lgammaf (float); extern float erff (float); extern float erfcf (float); extern float log2f (float); extern float hypotf (float, float); # 453 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/math.h" 3 4 extern long double atanl (long double); extern long double cosl (long double); extern long double sinl (long double); extern long double tanl (long double); extern long double tanhl (long double); extern long double frexpl (long double, int *); extern long double modfl (long double, long double *); extern long double ceill (long double); extern long double fabsl (long double); extern long double floorl (long double); extern long double log1pl (long double); extern long double expm1l (long double); extern long double acosl (long double); extern long double asinl (long double); extern long double atan2l (long double, long double); extern long double coshl (long double); extern long double sinhl (long double); extern long double expl (long double); extern long double ldexpl (long double, int); extern long double logl (long double); extern long double log10l (long double); extern long double powl (long double, long double); extern long double sqrtl (long double); extern long double fmodl (long double, long double); extern long double hypotl (long double, long double); extern long double copysignl (long double, long double); extern long double nanl (const char *); extern int ilogbl (long double); extern long double asinhl (long double); extern long double cbrtl (long double); extern long double nextafterl (long double, long double); extern float nexttowardf (float, long double); extern double nexttoward (double, long double); extern long double nexttowardl (long double, long double); extern long double logbl (long double); extern long double log2l (long double); extern long double rintl (long double); extern long double scalbnl (long double, int); extern long double exp2l (long double); extern long double scalblnl (long double, long); extern long double tgammal (long double); extern long double nearbyintl (long double); extern long int lrintl (long double); extern long long int llrintl (long double); extern long double roundl (long double); extern long lroundl (long double); extern long long int llroundl (long double); extern long double truncl (long double); extern long double remquol (long double, long double, int *); extern long double fdiml (long double, long double); extern long double fmaxl (long double, long double); extern long double fminl (long double, long double); extern long double fmal (long double, long double, long double); extern long double acoshl (long double); extern long double atanhl (long double); extern long double remainderl (long double, long double); extern long double lgammal (long double); extern long double erfl (long double); extern long double erfcl (long double); # 535 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/math.h" 3 4 extern double drem (double, double); extern float dremf (float, float); extern double gamma_r (double, int *); extern double lgamma_r (double, int *); extern float gammaf_r (float, int *); extern float lgammaf_r (float, int *); extern double y0 (double); extern double y1 (double); extern double yn (int, double); extern double j0 (double); extern double j1 (double); extern double jn (int, double); extern float y0f (float); extern float y1f (float); extern float ynf (int, float); extern float j0f (float); extern float j1f (float); extern float jnf (int, float); # 597 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/math.h" 3 4 extern int *__signgam (void); # 639 "/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/math.h" 3 4 # 79 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c" 2 # 80 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c" double # 80 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c" 3 4 _Complex # 81 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c" cacos(double # 81 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c" 3 4 _Complex # 81 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c" z) { double # 83 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c" 3 4 _Complex # 83 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c" w; # 92 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c" double # 92 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c" 3 4 _Complex # 92 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c" tmp0, tmp1; tmp0 = casin(z); tmp1 = # 95 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c" 3 4 1.57079632679489661923 # 95 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c" - creal(tmp0); w = tmp1 - (cimag(tmp0) * # 96 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c" 3 4 1.0fi # 96 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c" ); return w; }
On Fri, 7 Jun 2024, Jeff Law wrote: > > > On 6/6/24 4:10 AM, Manolis Tsamis wrote: > > This pass detects cases of expensive store forwarding and tries to avoid > > them > > by reordering the stores and using suitable bit insertion sequences. > > For example it can transform this: > > > > strb w2, [x1, 1] > > ldr x0, [x1] # Expensive store forwarding to larger load. > > > > To: > > > > ldr x0, [x1] > > strb w2, [x1] > > bfi x0, x2, 0, 8 > > > > Assembly like this can appear with bitfields or type punning / unions. > > On stress-ng when running the cpu-union microbenchmark the following > > speedups > > have been observed. > > > > Neoverse-N1: +29.4% > > Intel Coffeelake: +13.1% > > AMD 5950X: +17.5% > > > > gcc/ChangeLog: > > > > * Makefile.in: Add avoid-store-forwarding.o. > > * common.opt: New option -favoid-store-forwarding. > > * params.opt: New param store-forwarding-max-distance. > > * doc/invoke.texi: Document new pass. > > * doc/passes.texi: Document new pass. > > * passes.def: Schedule a new pass. > > * tree-pass.h (make_pass_rtl_avoid_store_forwarding): Declare. > > * avoid-store-forwarding.cc: New file. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/aarch64/avoid-store-forwarding-1.c: New test. > > * gcc.target/aarch64/avoid-store-forwarding-2.c: New test. > > * gcc.target/aarch64/avoid-store-forwarding-3.c: New test. > > * gcc.target/aarch64/avoid-store-forwarding-4.c: New test. > So this is getting a lot more interesting. I think the first time I looked at > this it was more concerned with stores feeding something like a load-pair and > avoiding the store forwarding penalty for that case. Am I mis-remembering, or > did it get significantly more general? > > > > > > > + > > +static unsigned int stats_sf_detected = 0; > > +static unsigned int stats_sf_avoided = 0; > > + > > +static rtx > > +get_load_mem (rtx expr) > Needs a function comment. You should probably mention that EXPR must be a > single_set in that comment. > > > > + > > + rtx dest; > > + if (eliminate_load) > > + dest = gen_reg_rtx (load_inner_mode); > > + else > > + dest = SET_DEST (load); > > + > > + int move_to_front = -1; > > + int total_cost = 0; > > + > > + /* Check if we can emit bit insert instructions for all forwarded stores. > > */ > > + FOR_EACH_VEC_ELT (stores, i, it) > > + { > > + it->mov_reg = gen_reg_rtx (GET_MODE (it->store_mem)); > > + rtx_insn *insns = NULL; > > + > > + /* If we're eliminating the load then find the store with zero offset > > + and use it as the base register to avoid a bit insert. */ > > + if (eliminate_load && it->offset == 0) > So often is this triggering? We have various codes in the gimple optimizers > to detect store followed by a load from the same address and do the > forwarding. If they're happening with any frequency that would be a good sign > code in DOM and elsewhere isn't working well. > > THe way these passes detect this case is to take store, flip the operands > around (ie, it looks like a load) and enter that into the expression hash > tables. After that standard redundancy elimination approaches will work. > > > > + { > > + start_sequence (); > > + > > + /* We can use a paradoxical subreg to force this to a wider mode, as > > + the only use will be inserting the bits (i.e., we don't care > > about > > + the value of the higher bits). */ > Which may be a good hint about the cases you're capturing -- if the > modes/sizes differ that would make more sense since I don't think we're as > likely to be capturing those cases. Yeah, we handle stores from constants quite well and FRE can forward stores from SSA names to smaller loads by inserting BIT_FIELD_REFs but it does have some additional restrictions. > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > > index 4e8967fd8ab..c769744d178 100644 > > --- a/gcc/doc/invoke.texi > > +++ b/gcc/doc/invoke.texi > > @@ -12657,6 +12657,15 @@ loop unrolling. > > This option is enabled by default at optimization levels @option{-O1}, > > @option{-O2}, @option{-O3}, @option{-Os}. > > > > +@opindex favoid-store-forwarding > > +@item -favoid-store-forwarding > > +@itemx -fno-avoid-store-forwarding > > +Many CPUs will stall for many cycles when a load partially depends on > > previous > > +smaller stores. This pass tries to detect such cases and avoid the penalty > > by > > +changing the order of the load and store and then fixing up the loaded > > value. > > + > > +Disabled by default. > Is there any particular reason why this would be off by default at -O1 or > higher? It would seem to me that on modern cores that this transformation > should easily be a win. Even on an old in-order core, avoiding the load with > the bit insert is likely profitable, just not as much so. I would think it's the targets to decide for a default. > > diff --git a/gcc/params.opt b/gcc/params.opt > > index d34ef545bf0..b8115f5c27a 100644 > > --- a/gcc/params.opt > > +++ b/gcc/params.opt > > @@ -1032,6 +1032,10 @@ Allow the store merging pass to introduce unaligned > > stores if it is legal to do > > Common Joined UInteger Var(param_store_merging_max_size) Init(65536) > > IntegerRange(1, 65536) Param Optimization > > Maximum size of a single store merging region in bytes. > > > > +-param=store-forwarding-max-distance= > > +Common Joined UInteger Var(param_store_forwarding_max_distance) Init(10) > > IntegerRange(1, 1000) Param Optimization > > +Maximum number of instruction distance that a small store forwarded to a > > larger load may stall. > I think you may need to run the update-urls script since you've added a new > option. > > > In general it seems pretty reasonable. > > I've actually added it to my tester just to see if there's any fallout. It'll > take a week to churn through the long running targets that bootstrap in QEMU, > but the crosses should have data Monday. > > jeff > > >
On Sat, Jun 8, 2024 at 1:31 AM Jeff Law <jeffreyalaw@gmail.com> wrote: > > > > On 6/6/24 4:10 AM, Manolis Tsamis wrote: > > This pass detects cases of expensive store forwarding and tries to avoid them > > by reordering the stores and using suitable bit insertion sequences. > > For example it can transform this: > > > > strb w2, [x1, 1] > > ldr x0, [x1] # Expensive store forwarding to larger load. > > > > To: > > > > ldr x0, [x1] > > strb w2, [x1] > > bfi x0, x2, 0, 8 > > > > Assembly like this can appear with bitfields or type punning / unions. > > On stress-ng when running the cpu-union microbenchmark the following speedups > > have been observed. > > > > Neoverse-N1: +29.4% > > Intel Coffeelake: +13.1% > > AMD 5950X: +17.5% > > > > gcc/ChangeLog: > > > > * Makefile.in: Add avoid-store-forwarding.o. > > * common.opt: New option -favoid-store-forwarding. > > * params.opt: New param store-forwarding-max-distance. > > * doc/invoke.texi: Document new pass. > > * doc/passes.texi: Document new pass. > > * passes.def: Schedule a new pass. > > * tree-pass.h (make_pass_rtl_avoid_store_forwarding): Declare. > > * avoid-store-forwarding.cc: New file. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/aarch64/avoid-store-forwarding-1.c: New test. > > * gcc.target/aarch64/avoid-store-forwarding-2.c: New test. > > * gcc.target/aarch64/avoid-store-forwarding-3.c: New test. > > * gcc.target/aarch64/avoid-store-forwarding-4.c: New test. > So this is getting a lot more interesting. I think the first time I > looked at this it was more concerned with stores feeding something like > a load-pair and avoiding the store forwarding penalty for that case. Am > I mis-remembering, or did it get significantly more general? > There was an older submission of a load-pair specific pass but this is a complete reimplementation and indeed significantly more general. Apart from being target independant, it addresses a number of important restrictions and can handle multiple store forwardings per load. It should be noted that it cannot handle the load-pair cases as these need special handling, but that's something we're planning to do in the future by reusing this infrastructure. > > > > > > + > > +static unsigned int stats_sf_detected = 0; > > +static unsigned int stats_sf_avoided = 0; > > + > > +static rtx > > +get_load_mem (rtx expr) > Needs a function comment. You should probably mention that EXPR must be > a single_set in that comment. > Ok, will do. > > > + > > + rtx dest; > > + if (eliminate_load) > > + dest = gen_reg_rtx (load_inner_mode); > > + else > > + dest = SET_DEST (load); > > + > > + int move_to_front = -1; > > + int total_cost = 0; > > + > > + /* Check if we can emit bit insert instructions for all forwarded stores. */ > > + FOR_EACH_VEC_ELT (stores, i, it) > > + { > > + it->mov_reg = gen_reg_rtx (GET_MODE (it->store_mem)); > > + rtx_insn *insns = NULL; > > + > > + /* If we're eliminating the load then find the store with zero offset > > + and use it as the base register to avoid a bit insert. */ > > + if (eliminate_load && it->offset == 0) > So often is this triggering? We have various codes in the gimple > optimizers to detect store followed by a load from the same address and > do the forwarding. If they're happening with any frequency that would > be a good sign code in DOM and elsewhere isn't working well. > > THe way these passes detect this case is to take store, flip the > operands around (ie, it looks like a load) and enter that into the > expression hash tables. After that standard redundancy elimination > approaches will work. > This condition doesn't handle this case (single store forwarding to same-sized load) but rather any number of smaller (non-overlapping) stores that cover the bytes of the load and hence the load can be eliminated. As the comment at the top mentions: If the stores cover all the bytes of the load without overlap then we can eliminate the load entirely and use the computed value instead. The specific condition (eliminate_load && it->offset == 0) is there to just find the store that doesn't need an offset relative to the load and use it as a "base" for the rest of the bit-insert sequences. Originally we didn't special-case this and used an uninitialized base register and then N bit-insert sequences but the result was usually suboptimal (e.g. R = zero, R = bit-insert X1, R = bit-insert X2). So if we know that the load will be eliminated we can choose one of the values as base instead and then do N - 1 subsequent bit-inserts (i.e. R = X1, R = bit-insert X2). > > > + { > > + start_sequence (); > > + > > + /* We can use a paradoxical subreg to force this to a wider mode, as > > + the only use will be inserting the bits (i.e., we don't care about > > + the value of the higher bits). */ > Which may be a good hint about the cases you're capturing -- if the > modes/sizes differ that would make more sense since I don't think we're > as likely to be capturing those cases. > Right, that's the case. > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > > index 4e8967fd8ab..c769744d178 100644 > > --- a/gcc/doc/invoke.texi > > +++ b/gcc/doc/invoke.texi > > @@ -12657,6 +12657,15 @@ loop unrolling. > > This option is enabled by default at optimization levels @option{-O1}, > > @option{-O2}, @option{-O3}, @option{-Os}. > > > > +@opindex favoid-store-forwarding > > +@item -favoid-store-forwarding > > +@itemx -fno-avoid-store-forwarding > > +Many CPUs will stall for many cycles when a load partially depends on previous > > +smaller stores. This pass tries to detect such cases and avoid the penalty by > > +changing the order of the load and store and then fixing up the loaded value. > > + > > +Disabled by default. > Is there any particular reason why this would be off by default at -O1 > or higher? It would seem to me that on modern cores that this > transformation should easily be a win. Even on an old in-order core, > avoiding the load with the bit insert is likely profitable, just not as > much so. > I don't have a strong opinion for that but I believe Richard's suggestion to decide this on a per-target basis also makes a lot of sense. Deciding whether the transformation is profitable is tightly tied to the architecture in question (i.e. how large the stall is and what sort of bit-insert instructions are available). In order to make this more widely applicable, I think we'll need a target hook that decides in which case the forwarded stores incur a penalty and thus the transformation makes sense. Afaik, for each CPU there may be cases that store forwarding is handled efficiently. > > > > > diff --git a/gcc/params.opt b/gcc/params.opt > > index d34ef545bf0..b8115f5c27a 100644 > > --- a/gcc/params.opt > > +++ b/gcc/params.opt > > @@ -1032,6 +1032,10 @@ Allow the store merging pass to introduce unaligned stores if it is legal to do > > Common Joined UInteger Var(param_store_merging_max_size) Init(65536) IntegerRange(1, 65536) Param Optimization > > Maximum size of a single store merging region in bytes. > > > > +-param=store-forwarding-max-distance= > > +Common Joined UInteger Var(param_store_forwarding_max_distance) Init(10) IntegerRange(1, 1000) Param Optimization > > +Maximum number of instruction distance that a small store forwarded to a larger load may stall. > I think you may need to run the update-urls script since you've added a > new option. > Ok, thanks for the hint. > > In general it seems pretty reasonable. > > I've actually added it to my tester just to see if there's any fallout. > It'll take a week to churn through the long running targets that > bootstrap in QEMU, but the crosses should have data Monday. > Great, thanks! Manolis > jeff >
On Sun, Jun 9, 2024 at 5:29 PM Jeff Law <jeffreyalaw@gmail.com> wrote: > > > > On 6/7/24 4:31 PM, Jeff Law wrote: > > > > > I've actually added it to my tester just to see if there's any fallout. > > It'll take a week to churn through the long running targets that > > bootstrap in QEMU, but the crosses should have data Monday. > The first round naturally didn't trigger anything because the option is > off by default. So I twiddled it to be on at -O1 and above. > > epiphany-elf ICEs in gen_rtx_SUBREG with the attached .i file compiled > with -O2: > > > root@577c7458c93a://home/jlaw/jenkins/workspace/epiphany-elf/epiphany-elf-obj/newlib/epiphany-elf/newlib/libm/complex# epiphany-elf-gcc -O2 libm_a-cacos.i > > during RTL pass: avoid_store_forwarding > > ../../../..//newlib-cygwin/newlib/libm/complex/cacos.c: In function 'cacos': > > ../../../..//newlib-cygwin/newlib/libm/complex/cacos.c:99:1: internal compiler error: in gen_rtx_SUBREG, at emit-rtl.cc:1032 > > 0x614538 gen_rtx_SUBREG(machine_mode, rtx_def*, poly_int<1u, unsigned long>) > > ../../..//gcc/gcc/emit-rtl.cc:1032 > > 0x614538 gen_rtx_SUBREG(machine_mode, rtx_def*, poly_int<1u, unsigned long>) > > ../../..//gcc/gcc/emit-rtl.cc:1030 > > 0xe82216 process_forwardings > > ../../..//gcc/gcc/avoid-store-forwarding.cc:273 > > 0xe82216 avoid_store_forwarding > > ../../..//gcc/gcc/avoid-store-forwarding.cc:489 > > 0xe82667 execute > > ../../..//gcc/gcc/avoid-store-forwarding.cc:558 > > > ft32-elf ICE'd in bitmap_check_index at various optimization levels: > > > FAIL: execute/pr108498-2.c -O1 (internal compiler error: in bitmap_check_index, at sbitmap.h:104) > > FAIL: execute/pr108498-2.c -O1 (test for excess errors) > > FAIL: execute/pr108498-2.c -O2 (internal compiler error: in bitmap_check_index, at sbitmap.h:104) > > FAIL: execute/pr108498-2.c -O2 (test for excess errors) > > FAIL: execute/pr108498-2.c -O3 -g (internal compiler error: in bitmap_check_index, at sbitmap.h:104) > > FAIL: execute/pr108498-2.c -O3 -g (test for excess errors) > > > avr, c6x, > > lm32-elf failed to build libgcc with an ICE in leaf_function_p, I > haven't isolated that yet. > > > There were other failures as well. But you've got a few to start with > and we can retest pretty easily as the patch evolves. > Indeed ;) Thanks for reporting these! Manolis > jeff >
On 6/10/24 1:55 AM, Manolis Tsamis wrote: >> > There was an older submission of a load-pair specific pass but this is > a complete reimplementation and indeed significantly more general. > Apart from being target independant, it addresses a number of > important restrictions and can handle multiple store forwardings per > load. > It should be noted that it cannot handle the load-pair cases as these > need special handling, but that's something we're planning to do in > the future by reusing this infrastructure. ACK. Thanks for the additional background. > >> >>> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi >>> index 4e8967fd8ab..c769744d178 100644 >>> --- a/gcc/doc/invoke.texi >>> +++ b/gcc/doc/invoke.texi >>> @@ -12657,6 +12657,15 @@ loop unrolling. >>> This option is enabled by default at optimization levels @option{-O1}, >>> @option{-O2}, @option{-O3}, @option{-Os}. >>> >>> +@opindex favoid-store-forwarding >>> +@item -favoid-store-forwarding >>> +@itemx -fno-avoid-store-forwarding >>> +Many CPUs will stall for many cycles when a load partially depends on previous >>> +smaller stores. This pass tries to detect such cases and avoid the penalty by >>> +changing the order of the load and store and then fixing up the loaded value. >>> + >>> +Disabled by default. >> Is there any particular reason why this would be off by default at -O1 >> or higher? It would seem to me that on modern cores that this >> transformation should easily be a win. Even on an old in-order core, >> avoiding the load with the bit insert is likely profitable, just not as >> much so. >> > I don't have a strong opinion for that but I believe Richard's > suggestion to decide this on a per-target basis also makes a lot of > sense. > Deciding whether the transformation is profitable is tightly tied to > the architecture in question (i.e. how large the stall is and what > sort of bit-insert instructions are available). > In order to make this more widely applicable, I think we'll need a > target hook that decides in which case the forwarded stores incur a > penalty and thus the transformation makes sense. You and Richi are probably right. I'm not a big fan of passes being enabled/disabled on particular targets, but it may make sense here. > Afaik, for each CPU there may be cases that store forwarding is > handled efficiently. Absolutely. But forwarding from a smaller store to a wider load is painful from a hardware standpoint and if we can avoid it from a codegen standpoint, we should. Did y'all look at spec2017 at all for this patch? I've got our hardware guys to expose a signal for this case so that we can (in a month or so) get some hard data on how often it's happening in spec2017 and evaluate how this patch helps the most affected workloads. But if y'all already have some data we can use it as a starting point. jeff
On Mon, 10 Jun 2024 at 20:03, Jeff Law <jeffreyalaw@gmail.com> wrote: > > > > On 6/10/24 1:55 AM, Manolis Tsamis wrote: > > >> > > There was an older submission of a load-pair specific pass but this is > > a complete reimplementation and indeed significantly more general. > > Apart from being target independant, it addresses a number of > > important restrictions and can handle multiple store forwardings per > > load. > > It should be noted that it cannot handle the load-pair cases as these > > need special handling, but that's something we're planning to do in > > the future by reusing this infrastructure. > ACK. Thanks for the additional background. > > > > > >> > >>> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > >>> index 4e8967fd8ab..c769744d178 100644 > >>> --- a/gcc/doc/invoke.texi > >>> +++ b/gcc/doc/invoke.texi > >>> @@ -12657,6 +12657,15 @@ loop unrolling. > >>> This option is enabled by default at optimization levels @option{-O1}, > >>> @option{-O2}, @option{-O3}, @option{-Os}. > >>> > >>> +@opindex favoid-store-forwarding > >>> +@item -favoid-store-forwarding > >>> +@itemx -fno-avoid-store-forwarding > >>> +Many CPUs will stall for many cycles when a load partially depends on previous > >>> +smaller stores. This pass tries to detect such cases and avoid the penalty by > >>> +changing the order of the load and store and then fixing up the loaded value. > >>> + > >>> +Disabled by default. > >> Is there any particular reason why this would be off by default at -O1 > >> or higher? It would seem to me that on modern cores that this > >> transformation should easily be a win. Even on an old in-order core, > >> avoiding the load with the bit insert is likely profitable, just not as > >> much so. > >> > > I don't have a strong opinion for that but I believe Richard's > > suggestion to decide this on a per-target basis also makes a lot of > > sense. > > Deciding whether the transformation is profitable is tightly tied to > > the architecture in question (i.e. how large the stall is and what > > sort of bit-insert instructions are available). > > In order to make this more widely applicable, I think we'll need a > > target hook that decides in which case the forwarded stores incur a > > penalty and thus the transformation makes sense. > You and Richi are probably right. I'm not a big fan of passes being > enabled/disabled on particular targets, but it may make sense here. > > > > > Afaik, for each CPU there may be cases that store forwarding is > > handled efficiently. > Absolutely. But forwarding from a smaller store to a wider load is > painful from a hardware standpoint and if we can avoid it from a codegen > standpoint, we should. This change is what I briefly hinted as "the complete solution" that we had on the drawing board when we briefly talked last November in Santa Clara. > Did y'all look at spec2017 at all for this patch? I've got our hardware > guys to expose a signal for this case so that we can (in a month or so) > get some hard data on how often it's happening in spec2017 and evaluate > how this patch helps the most affected workloads. But if y'all already > have some data we can use it as a starting point. We have looked at all of SPEC2017, especially for coverage (i.e., making sure we see a significant number of uses of the transformation) and correctness. The gcc_r and parest_r components triggered in a number of "interesting" ways (e.g., motivating the case of load-elimination). If it helps, we could share the statistics for how often the pass triggers on compiling each of the SPEC2017 components? Philipp.
On 6/10/24 12:27 PM, Philipp Tomsich wrote: > > This change is what I briefly hinted as "the complete solution" that > we had on the drawing board when we briefly talked last November in > Santa Clara. I haven't any recollection of that part of the discussion, but I was a bit frazzled as you probably noticed. > We have looked at all of SPEC2017, especially for coverage (i.e., > making sure we see a significant number of uses of the transformation) > and correctness. The gcc_r and parest_r components triggered in a > number of "interesting" ways (e.g., motivating the case of > load-elimination). If it helps, we could share the statistics for how > often the pass triggers on compiling each of the SPEC2017 components? Definitely helpful. I may be able to juggle some priorities internally to lend a larger hand on testing and helping move this forward. It's an area we're definitely interested in. Jeff
On Mon, 10 Jun 2024, Jeff Law wrote: > > > On 6/10/24 1:55 AM, Manolis Tsamis wrote: > > >> > > There was an older submission of a load-pair specific pass but this is > > a complete reimplementation and indeed significantly more general. > > Apart from being target independant, it addresses a number of > > important restrictions and can handle multiple store forwardings per > > load. > > It should be noted that it cannot handle the load-pair cases as these > > need special handling, but that's something we're planning to do in > > the future by reusing this infrastructure. > ACK. Thanks for the additional background. > > > > > >> > >>> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > >>> index 4e8967fd8ab..c769744d178 100644 > >>> --- a/gcc/doc/invoke.texi > >>> +++ b/gcc/doc/invoke.texi > >>> @@ -12657,6 +12657,15 @@ loop unrolling. > >>> This option is enabled by default at optimization levels @option{-O1}, > >>> @option{-O2}, @option{-O3}, @option{-Os}. > >>> > >>> +@opindex favoid-store-forwarding > >>> +@item -favoid-store-forwarding > >>> +@itemx -fno-avoid-store-forwarding > >>> +Many CPUs will stall for many cycles when a load partially depends on > >>> previous > >>> +smaller stores. This pass tries to detect such cases and avoid the > >>> penalty by > >>> +changing the order of the load and store and then fixing up the loaded > >>> value. > >>> + > >>> +Disabled by default. > >> Is there any particular reason why this would be off by default at -O1 > >> or higher? It would seem to me that on modern cores that this > >> transformation should easily be a win. Even on an old in-order core, > >> avoiding the load with the bit insert is likely profitable, just not as > >> much so. > >> > > I don't have a strong opinion for that but I believe Richard's > > suggestion to decide this on a per-target basis also makes a lot of > > sense. > > Deciding whether the transformation is profitable is tightly tied to > > the architecture in question (i.e. how large the stall is and what > > sort of bit-insert instructions are available). > > In order to make this more widely applicable, I think we'll need a > > target hook that decides in which case the forwarded stores incur a > > penalty and thus the transformation makes sense. > You and Richi are probably right. I'm not a big fan of passes being > enabled/disabled on particular targets, but it may make sense here. > > > > > Afaik, for each CPU there may be cases that store forwarding is > > handled efficiently. > Absolutely. But forwarding from a smaller store to a wider load is painful > from a hardware standpoint and if we can avoid it from a codegen standpoint, > we should. Note there's also the possibility to increase the distance between the store and the load - in fact the time a store takes to a) retire and b) get from the store buffers to where the load-store unit would pick it up (L1-D) is another target specific tuning knob. That said, if that distance isn't too large (on x86 there might be only an upper bound given by the OOO window size and the L1D store latency(?), possibly also additionally by the store buffer size) attacking the issue in sched1 or sched2 might be another possibility. So I think pass placement is another thing to look at - I'd definitely place it after sched1 but I guess without looking at the pass again it's way before that? Richard. > Did y'all look at spec2017 at all for this patch? I've got our hardware guys > to expose a signal for this case so that we can (in a month or so) get some > hard data on how often it's happening in spec2017 and evaluate how this patch > helps the most affected workloads. But if y'all already have some data we can > use it as a starting point. > > jeff
On 6/11/24 1:22 AM, Richard Biener wrote: >> Absolutely. But forwarding from a smaller store to a wider load is painful >> from a hardware standpoint and if we can avoid it from a codegen standpoint, >> we should. > > Note there's also the possibility to increase the distance between the > store and the load - in fact the time a store takes to a) retire and > b) get from the store buffers to where the load-store unit would pick it > up (L1-D) is another target specific tuning knob. That said, if that > distance isn't too large (on x86 there might be only an upper bound > given by the OOO window size and the L1D store latency(?), possibly > also additionally by the store buffer size) attacking the issue in > sched1 or sched2 might be another possibility. So I think pass placement > is another thing to look at - I'd definitely place it after sched1 > but I guess without looking at the pass again it's way before that? True, but I doubt there are enough instructions we could sink the load past to make a measurable difference. This is especially true on the class of uarchs where this is going to be most important. In the case where the store/load can't be interchanged and thus this new pass rejects any transformation, we could try to do something in the scheduler to defer the load as long as possible. Essentially it's a true dependency through a memory location using must-aliasing properties and in that case we'd want to crank up the "latency" of the store so that the load gets pushed away. I think one of the difficulties here is we often model stores as not having any latency (which is probably OK in most cases). Input data dependencies and structural hazards dominate dominate considerations for stores. jeff
On Tue, 11 Jun 2024 at 15:37, Jeff Law <jeffreyalaw@gmail.com> wrote: > > > > On 6/11/24 1:22 AM, Richard Biener wrote: > > >> Absolutely. But forwarding from a smaller store to a wider load is painful > >> from a hardware standpoint and if we can avoid it from a codegen standpoint, > >> we should. > > > > Note there's also the possibility to increase the distance between the > > store and the load - in fact the time a store takes to a) retire and > > b) get from the store buffers to where the load-store unit would pick it > > up (L1-D) is another target specific tuning knob. That said, if that > > distance isn't too large (on x86 there might be only an upper bound > > given by the OOO window size and the L1D store latency(?), possibly > > also additionally by the store buffer size) attacking the issue in > > sched1 or sched2 might be another possibility. So I think pass placement > > is another thing to look at - I'd definitely place it after sched1 > > but I guess without looking at the pass again it's way before that? > True, but I doubt there are enough instructions we could sink the load > past to make a measurable difference. This is especially true on the > class of uarchs where this is going to be most important. > > In the case where the store/load can't be interchanged and thus this new > pass rejects any transformation, we could try to do something in the > scheduler to defer the load as long as possible. Essentially it's a > true dependency through a memory location using must-aliasing properties > and in that case we'd want to crank up the "latency" of the store so > that the load gets pushed away. > > I think one of the difficulties here is we often model stores as not > having any latency (which is probably OK in most cases). Input data > dependencies and structural hazards dominate dominate considerations for > stores. I don't think that TARGET_SCHED_ADJUST_COST would even be called for a data-dependence through a memory location. Note that, strictly speaking, the store does not have an extended latency; it will be the load that will have an increased latency (almost as if we knew that the load will miss to one of the outer points-of-coherence). The difference being that the load would not hang around in a scheduling queue until being dispatched, but its execution would start immediately and take more cycles (and potentially block an execution pipeline for longer). Philipp.
On 6/11/24 7:52 AM, Philipp Tomsich wrote: > On Tue, 11 Jun 2024 at 15:37, Jeff Law <jeffreyalaw@gmail.com> wrote: >> >> >> >> On 6/11/24 1:22 AM, Richard Biener wrote: >> >>>> Absolutely. But forwarding from a smaller store to a wider load is painful >>>> from a hardware standpoint and if we can avoid it from a codegen standpoint, >>>> we should. >>> >>> Note there's also the possibility to increase the distance between the >>> store and the load - in fact the time a store takes to a) retire and >>> b) get from the store buffers to where the load-store unit would pick it >>> up (L1-D) is another target specific tuning knob. That said, if that >>> distance isn't too large (on x86 there might be only an upper bound >>> given by the OOO window size and the L1D store latency(?), possibly >>> also additionally by the store buffer size) attacking the issue in >>> sched1 or sched2 might be another possibility. So I think pass placement >>> is another thing to look at - I'd definitely place it after sched1 >>> but I guess without looking at the pass again it's way before that? >> True, but I doubt there are enough instructions we could sink the load >> past to make a measurable difference. This is especially true on the >> class of uarchs where this is going to be most important. >> >> In the case where the store/load can't be interchanged and thus this new >> pass rejects any transformation, we could try to do something in the >> scheduler to defer the load as long as possible. Essentially it's a >> true dependency through a memory location using must-aliasing properties >> and in that case we'd want to crank up the "latency" of the store so >> that the load gets pushed away. >> >> I think one of the difficulties here is we often model stores as not >> having any latency (which is probably OK in most cases). Input data >> dependencies and structural hazards dominate dominate considerations for >> stores. > > I don't think that TARGET_SCHED_ADJUST_COST would even be called for a > data-dependence through a memory location. Probably correct, but we could adjust that behavior or add another mechanism to adjust costs based on memory dependencies. > > Note that, strictly speaking, the store does not have an extended > latency; it will be the load that will have an increased latency > (almost as if we knew that the load will miss to one of the outer > points-of-coherence). The difference being that the load would not > hang around in a scheduling queue until being dispatched, but its > execution would start immediately and take more cycles (and > potentially block an execution pipeline for longer). Absolutely true. I'm being imprecise in my language, increasing the "latency" of the store is really a proxy for "do something to encourage the load to move away from the store". But overall rewriting the sequence is probably the better choice. In my mind the scheduler approach would be a secondary attempt if we couldn't interchange the store/load. And I'd make a small bet that its impact would be on the margins if we're doing a reasonable job in the new pass. Jeff
On Tue, 11 Jun 2024, Jeff Law wrote: > > > On 6/11/24 7:52 AM, Philipp Tomsich wrote: > > On Tue, 11 Jun 2024 at 15:37, Jeff Law <jeffreyalaw@gmail.com> wrote: > >> > >> > >> > >> On 6/11/24 1:22 AM, Richard Biener wrote: > >> > >>>> Absolutely. But forwarding from a smaller store to a wider load is > >>>> painful > >>>> from a hardware standpoint and if we can avoid it from a codegen > >>>> standpoint, > >>>> we should. > >>> > >>> Note there's also the possibility to increase the distance between the > >>> store and the load - in fact the time a store takes to a) retire and > >>> b) get from the store buffers to where the load-store unit would pick it > >>> up (L1-D) is another target specific tuning knob. That said, if that > >>> distance isn't too large (on x86 there might be only an upper bound > >>> given by the OOO window size and the L1D store latency(?), possibly > >>> also additionally by the store buffer size) attacking the issue in > >>> sched1 or sched2 might be another possibility. So I think pass placement > >>> is another thing to look at - I'd definitely place it after sched1 > >>> but I guess without looking at the pass again it's way before that? > >> True, but I doubt there are enough instructions we could sink the load > >> past to make a measurable difference. This is especially true on the > >> class of uarchs where this is going to be most important. > >> > >> In the case where the store/load can't be interchanged and thus this new > >> pass rejects any transformation, we could try to do something in the > >> scheduler to defer the load as long as possible. Essentially it's a > >> true dependency through a memory location using must-aliasing properties > >> and in that case we'd want to crank up the "latency" of the store so > >> that the load gets pushed away. > >> > >> I think one of the difficulties here is we often model stores as not > >> having any latency (which is probably OK in most cases). Input data > >> dependencies and structural hazards dominate dominate considerations for > >> stores. > > > > I don't think that TARGET_SCHED_ADJUST_COST would even be called for a > > data-dependence through a memory location. > Probably correct, but we could adjust that behavior or add another mechanism > to adjust costs based on memory dependencies. > > > > > Note that, strictly speaking, the store does not have an extended > > latency; it will be the load that will have an increased latency > > (almost as if we knew that the load will miss to one of the outer > > points-of-coherence). The difference being that the load would not > > hang around in a scheduling queue until being dispatched, but its > > execution would start immediately and take more cycles (and > > potentially block an execution pipeline for longer). > Absolutely true. I'm being imprecise in my language, increasing the "latency" > of the store is really a proxy for "do something to encourage the load to move > away from the store". > > But overall rewriting the sequence is probably the better choice. In my mind > the scheduler approach would be a secondary attempt if we couldn't interchange > the store/load. And I'd make a small bet that its impact would be on the > margins if we're doing a reasonable job in the new pass. One of the points I wanted to make is that sched1 can make quite a difference as to the relative distance of the store and load and we have the instruction window the pass considers when scanning (possibly driven by target uarch details). So doing the rewriting before sched1 might be not ideal (but I don't know how much cleanup work the pass leaves behind - there's nothing between sched1 and RA). On the hardware side I always wondered whether a failed load-to-store forward results in the load uop stalling (because the hardware actually _did_ see the conflict with an in-flight store) or whether this gets catched later as the hardware speculates a load from L1 (with the wrong value) but has to roll back because of the conflict. I would imagine the latter is cheaper to implement but worse in case of conflict. Richard.
On Mon, Jun 10, 2024 at 9:27 PM Philipp Tomsich <philipp.tomsich@vrull.eu> wrote: > > On Mon, 10 Jun 2024 at 20:03, Jeff Law <jeffreyalaw@gmail.com> wrote: > > > > > > > > On 6/10/24 1:55 AM, Manolis Tsamis wrote: > > > > >> > > > There was an older submission of a load-pair specific pass but this is > > > a complete reimplementation and indeed significantly more general. > > > Apart from being target independant, it addresses a number of > > > important restrictions and can handle multiple store forwardings per > > > load. > > > It should be noted that it cannot handle the load-pair cases as these > > > need special handling, but that's something we're planning to do in > > > the future by reusing this infrastructure. > > ACK. Thanks for the additional background. > > > > > > > > > >> > > >>> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > > >>> index 4e8967fd8ab..c769744d178 100644 > > >>> --- a/gcc/doc/invoke.texi > > >>> +++ b/gcc/doc/invoke.texi > > >>> @@ -12657,6 +12657,15 @@ loop unrolling. > > >>> This option is enabled by default at optimization levels @option{-O1}, > > >>> @option{-O2}, @option{-O3}, @option{-Os}. > > >>> > > >>> +@opindex favoid-store-forwarding > > >>> +@item -favoid-store-forwarding > > >>> +@itemx -fno-avoid-store-forwarding > > >>> +Many CPUs will stall for many cycles when a load partially depends on previous > > >>> +smaller stores. This pass tries to detect such cases and avoid the penalty by > > >>> +changing the order of the load and store and then fixing up the loaded value. > > >>> + > > >>> +Disabled by default. > > >> Is there any particular reason why this would be off by default at -O1 > > >> or higher? It would seem to me that on modern cores that this > > >> transformation should easily be a win. Even on an old in-order core, > > >> avoiding the load with the bit insert is likely profitable, just not as > > >> much so. > > >> > > > I don't have a strong opinion for that but I believe Richard's > > > suggestion to decide this on a per-target basis also makes a lot of > > > sense. > > > Deciding whether the transformation is profitable is tightly tied to > > > the architecture in question (i.e. how large the stall is and what > > > sort of bit-insert instructions are available). > > > In order to make this more widely applicable, I think we'll need a > > > target hook that decides in which case the forwarded stores incur a > > > penalty and thus the transformation makes sense. > > You and Richi are probably right. I'm not a big fan of passes being > > enabled/disabled on particular targets, but it may make sense here. > > > > > > > > > Afaik, for each CPU there may be cases that store forwarding is > > > handled efficiently. > > Absolutely. But forwarding from a smaller store to a wider load is > > painful from a hardware standpoint and if we can avoid it from a codegen > > standpoint, we should. > > This change is what I briefly hinted as "the complete solution" that > we had on the drawing board when we briefly talked last November in > Santa Clara. > > > Did y'all look at spec2017 at all for this patch? I've got our hardware > > guys to expose a signal for this case so that we can (in a month or so) > > get some hard data on how often it's happening in spec2017 and evaluate > > how this patch helps the most affected workloads. But if y'all already > > have some data we can use it as a starting point. > > We have looked at all of SPEC2017, especially for coverage (i.e., > making sure we see a significant number of uses of the transformation) > and correctness. The gcc_r and parest_r components triggered in a > number of "interesting" ways (e.g., motivating the case of > load-elimination). If it helps, we could share the statistics for how > often the pass triggers on compiling each of the SPEC2017 components? > Below is a table with 1) the number of successful store-forwarding-avoidance transformations and 2) the number of these where the load was also eliminated. This is SPEC2017 intrate and fprate; the omitted benchmarks in each case have zero transformations. Following Richard's comment I did two runs: One with the pass placed just after cse1 (this patch) and one with it placed after sched1 (Richard's suggestion). I see an increased number of transformations after sched1 and also in some testcases we get improved code generation so it looks promising. The original motivation was that placing this early will enable subsequent passes to optimize the transformed code, but it looks like this is not a major issue and the load/store placement it more important. I plan to follow up and provide some rough performance metrics and example assembly code of the transformation for these benchmarks. I also tried increasing the maximum distance from 10 to 15 and so only a small increase in transformation count. From what I've seen most cases that we care about are usually close. The data (benchmark, # transformed, # load elimination): After cse1: 500: 32 1 502: 89 9 510: 517 409 511: 12 1 520: 1 0 521: 4 1 525: 51 1 526: 46 11 557: 1 1 After sched1: 500: 45 13 502: 172 33 507: 23 23 510: 544 427 511: 41 29 520: 119 26 521: 19 11 523: 4 4 525: 129 44 526: 120 70 531: 1 1 538: 30 18 541: 2 2 557: 4 4 Note: ran on AArch64 with -O3 -mcpu=neoverse-n1 -flto=32. Thanks, Manolis > Philipp.
On 6/12/24 12:47 AM, Richard Biener wrote: > > One of the points I wanted to make is that sched1 can make quite a > difference as to the relative distance of the store and load and > we have the instruction window the pass considers when scanning > (possibly driven by target uarch details). So doing the rewriting > before sched1 might be not ideal (but I don't know how much cleanup > work the pass leaves behind - there's nothing between sched1 and RA). ACK. I guess I'm just skeptical about much separation we can get in practice from scheduling. As far as cleanup opportunity, it likely comes down to how clean the initial codegen is for the bitfield insertion step. > > On the hardware side I always wondered whether a failed load-to-store > forward results in the load uop stalling (because the hardware actually > _did_ see the conflict with an in-flight store) or whether this gets > catched later as the hardware speculates a load from L1 (with the > wrong value) but has to roll back because of the conflict. I would > imagine the latter is cheaper to implement but worse in case of > conflict. I wouldn't be surprised to see both approaches being used and I suspect it really depends on how speculative your uarch is. At some point there's enough speculation going on that you can't detect the violation early enough and you have to implement a replay/rollback scheme. jeff
On Sun, Jun 9, 2024 at 5:29 PM Jeff Law <jeffreyalaw@gmail.com> wrote: > > > > On 6/7/24 4:31 PM, Jeff Law wrote: > > > > > I've actually added it to my tester just to see if there's any fallout. > > It'll take a week to churn through the long running targets that > > bootstrap in QEMU, but the crosses should have data Monday. > The first round naturally didn't trigger anything because the option is > off by default. So I twiddled it to be on at -O1 and above. > > epiphany-elf ICEs in gen_rtx_SUBREG with the attached .i file compiled > with -O2: > > > root@577c7458c93a://home/jlaw/jenkins/workspace/epiphany-elf/epiphany-elf-obj/newlib/epiphany-elf/newlib/libm/complex# epiphany-elf-gcc -O2 libm_a-cacos.i > > during RTL pass: avoid_store_forwarding > > ../../../..//newlib-cygwin/newlib/libm/complex/cacos.c: In function 'cacos': > > ../../../..//newlib-cygwin/newlib/libm/complex/cacos.c:99:1: internal compiler error: in gen_rtx_SUBREG, at emit-rtl.cc:1032 > > 0x614538 gen_rtx_SUBREG(machine_mode, rtx_def*, poly_int<1u, unsigned long>) > > ../../..//gcc/gcc/emit-rtl.cc:1032 > > 0x614538 gen_rtx_SUBREG(machine_mode, rtx_def*, poly_int<1u, unsigned long>) > > ../../..//gcc/gcc/emit-rtl.cc:1030 > > 0xe82216 process_forwardings > > ../../..//gcc/gcc/avoid-store-forwarding.cc:273 > > 0xe82216 avoid_store_forwarding > > ../../..//gcc/gcc/avoid-store-forwarding.cc:489 > > 0xe82667 execute > > ../../..//gcc/gcc/avoid-store-forwarding.cc:558 > Fixed in v3. Since we're dealing with arbitrary modes we need to call validate_subreg before doing gen_rtx_SUBREG. E.g. a SFMode to TIMode subreg would fail there. The code already handled not being able to use the subreg if recog failed so only an additional check was needed. I have added a new testcase function ssll_load_elim_2 that triggered this. > > ft32-elf ICE'd in bitmap_check_index at various optimization levels: > > > FAIL: execute/pr108498-2.c -O1 (internal compiler error: in bitmap_check_index, at sbitmap.h:104) > > FAIL: execute/pr108498-2.c -O1 (test for excess errors) > > FAIL: execute/pr108498-2.c -O2 (internal compiler error: in bitmap_check_index, at sbitmap.h:104) > > FAIL: execute/pr108498-2.c -O2 (test for excess errors) > > FAIL: execute/pr108498-2.c -O3 -g (internal compiler error: in bitmap_check_index, at sbitmap.h:104) > > FAIL: execute/pr108498-2.c -O3 -g (test for excess errors) > Fixed in v3 as well. The offending store had BLKMode and returned a store size of 0 which could overflow it->offset + store_size - 1 among others. A new check is added to reject any BLKMode set. I also added a check for sets that have a memory source and destination as these wouldn't be handled correctly. > > avr, c6x, > > lm32-elf failed to build libgcc with an ICE in leaf_function_p, I > haven't isolated that yet. > > > There were other failures as well. But you've got a few to start with > and we can retest pretty easily as the patch evolves. > The v3 also contains a fix for a memory corruption bug due to stores.safe_push (stores[move_to_front]) having a self reference and storing garbage when the push resulted in a resize. A testcase function ssll_load_elim_3 that triggered this has been added. Could you please run the v3 with your tester? I assume some of the additional fixes introduced may clear some of the other issues. Thanks, Manolis > jeff >
On 6/13/24 5:40 AM, Manolis Tsamis wrote: > > Could you please run the v3 with your tester? I assume some of the > additional fixes introduced may clear some of the other issues. Will do. jeff
diff --git a/gcc/Makefile.in b/gcc/Makefile.in index c983b0c102a..1fd68c7d182 100644 --- a/gcc/Makefile.in +++ b/gcc/Makefile.in @@ -1683,6 +1683,7 @@ OBJS = \ statistics.o \ stmt.o \ stor-layout.o \ + avoid-store-forwarding.o \ store-motion.o \ streamer-hooks.o \ stringpool.o \ diff --git a/gcc/avoid-store-forwarding.cc b/gcc/avoid-store-forwarding.cc new file mode 100644 index 00000000000..b641451a6b7 --- /dev/null +++ b/gcc/avoid-store-forwarding.cc @@ -0,0 +1,578 @@ +/* Avoid store forwarding optimization pass. + Copyright (C) 2024 Free Software Foundation, Inc. + Contributed by VRULL GmbH. + + This file is part of GCC. + + GCC is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + GCC is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received a copy of the GNU General Public License + along with GCC; see the file COPYING3. If not see + <http://www.gnu.org/licenses/>. */ + +#include "config.h" +#include "system.h" +#include "coretypes.h" +#include "backend.h" +#include "rtl.h" +#include "alias.h" +#include "rtlanal.h" +#include "tree-pass.h" +#include "cselib.h" +#include "predict.h" +#include "insn-config.h" +#include "expmed.h" +#include "recog.h" +#include "regset.h" +#include "df.h" +#include "expr.h" +#include "memmodel.h" +#include "emit-rtl.h" +#include "vec.h" + +/* This pass tries to detect and avoid cases of store forwarding. + On many processors there is a large penalty when smaller stores are + forwarded to larger loads. The idea used to avoid the stall is to move + the store after the load and in addition emit a bit insert sequence so + the load register has the correct value. For example the following: + + strb w2, [x1, 1] + ldr x0, [x1] + + Will be transformed to: + + ldr x0, [x1] + and w2, w2, 255 + strb w2, [x1] + bfi x0, x2, 0, 8 +*/ + +namespace { + +const pass_data pass_data_avoid_store_forwarding = +{ + RTL_PASS, /* type. */ + "avoid_store_forwarding", /* name. */ + OPTGROUP_NONE, /* optinfo_flags. */ + TV_NONE, /* tv_id. */ + 0, /* properties_required. */ + 0, /* properties_provided. */ + 0, /* properties_destroyed. */ + 0, /* todo_flags_start. */ + TODO_df_finish /* todo_flags_finish. */ +}; + +class pass_rtl_avoid_store_forwarding : public rtl_opt_pass +{ +public: + pass_rtl_avoid_store_forwarding (gcc::context *ctxt) + : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt) + {} + + /* opt_pass methods: */ + virtual bool gate (function *) + { + return flag_avoid_store_forwarding && optimize >= 1 + && optimize_insn_for_speed_p (); + } + + virtual unsigned int execute (function *) override; +}; // class pass_rtl_avoid_store_forwarding + +typedef struct +{ + /* The store instruction that is a store forwarding candidate. */ + rtx_insn *store_insn; + /* SET_DEST (single_set (store_insn)). */ + rtx store_mem; + /* The temporary that will hold the stored value at the original store + position. */ + rtx mov_reg; + /* The instruction sequence that inserts the stored value's bits at the + appropriate position in the loaded value. */ + rtx_insn *bits_insert_insns; + /* The byte offset for the store's position within the load. */ + HOST_WIDE_INT offset; + + unsigned int insn_cnt; + bool remove; + bool forwarded; +} store_info; + +static unsigned int stats_sf_detected = 0; +static unsigned int stats_sf_avoided = 0; + +static rtx +get_load_mem (rtx expr) +{ + if (!expr) + return NULL_RTX; + + rtx mem = SET_SRC (expr); + + if (GET_CODE (mem) == ZERO_EXTEND + || GET_CODE (mem) == SIGN_EXTEND) + mem = XEXP (mem, 0); + + if (MEM_P (mem)) + return mem; + else + return NULL_RTX; +} + +/* Return true iff a store to STORE_MEM would write to a sub-region of bytes + from what LOAD_MEM would read. If true also store the relative byte offset + of the store within the load to OFF_VAL. */ + +static bool +is_store_forwarding (rtx store_mem, rtx load_mem, HOST_WIDE_INT *off_val) +{ + if (known_ge (GET_MODE_SIZE (GET_MODE (store_mem)), + GET_MODE_SIZE (GET_MODE (load_mem)))) + return false; + + rtx off = simplify_gen_binary (MINUS, GET_MODE (XEXP (store_mem, 0)), + XEXP (store_mem, 0), XEXP (load_mem, 0)); + + if (CONST_INT_P (off)) + { + *off_val = INTVAL (off); + poly_uint64 store_mode_size = GET_MODE_SIZE (GET_MODE (store_mem)); + poly_uint64 load_mode_size = GET_MODE_SIZE (GET_MODE (load_mem)); + unsigned HOST_WIDE_INT store_size, load_size; + if (store_mode_size.is_constant (&store_size) + && load_mode_size.is_constant (&load_size)) + { + return *off_val >= 0 + && (store_size + *off_val <= load_size); + } + } + + return false; +} + +/* Return a bit insertion sequence that would make DEST have the correct value + if the store represented by STORE_INFO were to be moved after DEST. */ + +static rtx_insn * +generate_bit_insert_sequence (store_info *store_info, rtx dest, + machine_mode load_inner_mode) +{ + poly_uint64 store_mode_size + = GET_MODE_SIZE (GET_MODE (store_info->store_mem)); + poly_uint64 load_mode_size = GET_MODE_SIZE (load_inner_mode); + unsigned HOST_WIDE_INT store_size = store_mode_size.to_constant (); + unsigned HOST_WIDE_INT load_size = load_mode_size.to_constant (); + HOST_WIDE_INT bf_offset_bytes; + + if (BYTES_BIG_ENDIAN) + bf_offset_bytes = load_size - store_size - store_info->offset; + else + bf_offset_bytes = store_info->offset; + + start_sequence (); + store_bit_field (dest, store_size * BITS_PER_UNIT, + bf_offset_bytes * BITS_PER_UNIT, 0, 0, + GET_MODE (dest), store_info->mov_reg, + false, false); + + rtx_insn *insns = get_insns (); + + for (rtx_insn *insn = insns; insn; insn = NEXT_INSN (insn)) + if (contains_mem_rtx_p (PATTERN (insn))) + return NULL; + + end_sequence (); + + return insns; +} + +/* Given a list of small stores that are forwarded to LOAD_INSN, try to + rearrange them so that a store-forwarding penalty doesn't occur. */ + +static bool +process_forwardings (vec<store_info> &stores, rtx_insn *load_insn) +{ + rtx load = single_set (load_insn); + machine_mode load_inner_mode = GET_MODE (get_load_mem (load)); + poly_uint64 load_mode_size = GET_MODE_SIZE (load_inner_mode); + HOST_WIDE_INT load_size = load_mode_size.to_constant (); + + /* If the stores cover all the bytes of the load without overlap then we can + eliminate the load entirely and use the computed value instead. */ + + sbitmap forwarded_bytes = sbitmap_alloc (load_size); + + unsigned int i; + store_info* it; + FOR_EACH_VEC_ELT (stores, i, it) + { + poly_uint64 store_mode_size = GET_MODE_SIZE (GET_MODE (it->store_mem)); + HOST_WIDE_INT store_size = store_mode_size.to_constant (); + if (bitmap_bit_in_range_p (forwarded_bytes, it->offset, + it->offset + store_size - 1)) + break; + bitmap_set_range (forwarded_bytes, it->offset, store_size); + } + + bitmap_not (forwarded_bytes, forwarded_bytes); + bool eliminate_load = bitmap_empty_p (forwarded_bytes); + + stats_sf_detected++; + + if (dump_file) + { + fprintf (dump_file, "Store forwarding%s detected:\n", + (stores.length () > 1) ? "s" : ""); + + FOR_EACH_VEC_ELT (stores, i, it) + { + fprintf (dump_file, "From: "); + print_rtl_single (dump_file, it->store_insn); + } + + fprintf (dump_file, "To: "); + print_rtl_single (dump_file, load_insn); + + if (eliminate_load) + fprintf (dump_file, "(Load elimination candidate)\n"); + } + + rtx dest; + if (eliminate_load) + dest = gen_reg_rtx (load_inner_mode); + else + dest = SET_DEST (load); + + int move_to_front = -1; + int total_cost = 0; + + /* Check if we can emit bit insert instructions for all forwarded stores. */ + FOR_EACH_VEC_ELT (stores, i, it) + { + it->mov_reg = gen_reg_rtx (GET_MODE (it->store_mem)); + rtx_insn *insns = NULL; + + /* If we're eliminating the load then find the store with zero offset + and use it as the base register to avoid a bit insert. */ + if (eliminate_load && it->offset == 0) + { + start_sequence (); + + /* We can use a paradoxical subreg to force this to a wider mode, as + the only use will be inserting the bits (i.e., we don't care about + the value of the higher bits). */ + rtx ext0 = gen_rtx_SUBREG (GET_MODE (dest), it->mov_reg, 0); + rtx_insn *move0 = emit_move_insn (dest, ext0); + if (recog_memoized (move0) >= 0) + { + insns = get_insns (); + move_to_front = (int) i; + } + + end_sequence (); + } + + if (!insns) + insns = generate_bit_insert_sequence (&(*it), dest, load_inner_mode); + + if (!insns) + { + if (dump_file) + { + fprintf (dump_file, "Failed due to: "); + print_rtl_single (dump_file, it->store_insn); + } + return false; + } + + total_cost += seq_cost (insns, true); + it->bits_insert_insns = insns; + } + + if (eliminate_load) + total_cost -= COSTS_N_INSNS (1); + + /* param_store_forwarding_max_distance should be somewhat correlated to the + store forwarding penalty; if the penalty is large then it is justified to + increase the window size. As such we can use it to reject sequences that + are clearly unprofitable. */ + int max_cost = COSTS_N_INSNS (param_store_forwarding_max_distance / 2); + if (total_cost > max_cost) + { + if (dump_file) + fprintf (dump_file, "Not transformed due to sequence cost: %d > %d.\n", + total_cost, max_cost); + + return false; + } + + /* If we have a move instead of bit insert, it needs to be emitted first in + the resulting sequence. */ + if (move_to_front != -1) + { + stores.safe_push (stores[move_to_front]); + stores.ordered_remove (move_to_front); + } + + if (dump_file) + { + fprintf (dump_file, "Store forwarding%s avoided with bit inserts:\n", + (stores.length () > 1) ? "s" : ""); + + FOR_EACH_VEC_ELT (stores, i, it) + { + if (stores.length () > 1) + { + fprintf (dump_file, "For: "); + print_rtl_single (dump_file, it->store_insn); + } + + fprintf (dump_file, "With sequence:\n"); + + for (rtx_insn *insn = it->bits_insert_insns; insn; + insn = NEXT_INSN (insn)) + { + fprintf (dump_file, " "); + print_rtl_single (dump_file, insn); + } + } + } + + stats_sf_avoided++; + + if (eliminate_load) + { + machine_mode outter_mode = GET_MODE (SET_DEST (load)); + rtx_code extend = ZERO_EXTEND; + if (outter_mode != load_inner_mode) + extend = GET_CODE (SET_SRC (load)); + + rtx load_value = simplify_gen_unary (extend, outter_mode, dest, + load_inner_mode); + rtx load_move = gen_move_insn (SET_DEST (load), load_value); + df_insn_rescan (emit_insn_after (load_move, load_insn)); + } + + FOR_EACH_VEC_ELT (stores, i, it) + { + /* Emit code that updated the loaded value to account for the + missing store. */ + df_insn_rescan (emit_insn_after (it->bits_insert_insns, load_insn)); + } + + FOR_EACH_VEC_ELT (stores, i, it) + { + rtx store_set = single_set (it->store_insn); + /* Create a register move at the store's original position to save the + stored value. */ + rtx mov1 = gen_move_insn (it->mov_reg, SET_SRC (store_set)); + df_insn_rescan (emit_insn_before (mov1, it->store_insn)); + /* Create a new store after the load with the saved original value. + This avoids the forwarding stall. */ + rtx mov2 = gen_move_insn (SET_DEST (store_set), it->mov_reg); + df_insn_rescan (emit_insn_after (mov2, load_insn)); + /* Done, delete the original store. */ + set_insn_deleted (it->store_insn); + } + + df_insn_rescan (load_insn); + + if (eliminate_load) + set_insn_deleted (load_insn); + + return true; +} + +/* Process BB for expensive store forwardings. */ + +static void +avoid_store_forwarding (basic_block bb) +{ + auto_vec<store_info, 8> store_exprs; + rtx_insn *insn; + unsigned int insn_cnt = 0; + + FOR_BB_INSNS (bb, insn) + { + if (!NONDEBUG_INSN_P (insn)) + continue; + + rtx set = single_set (insn); + + /* Store forwarding issues are unlikely if we cross a call. + Clear store forwarding candidates if we can't understand INSN. */ + if (CALL_P (insn) || !set || volatile_refs_p (set)) + { + store_exprs.truncate (0); + continue; + } + + rtx load_mem = get_load_mem (set); + int removed_count = 0; + + if (MEM_P (SET_DEST (set))) + { + /* Record store forwarding candidate. */ + store_info info; + info.store_insn = insn; + info.store_mem = SET_DEST (set); + info.insn_cnt = insn_cnt; + info.remove = false; + info.forwarded = false; + store_exprs.safe_push (info); + } + else if (load_mem) + { + /* Process load for possible store forwardings. */ + auto_vec<store_info> forwardings; + bool partial_forwarding = false; + bool remove_rest = false; + + unsigned int i; + store_info *it; + FOR_EACH_VEC_ELT_REVERSE (store_exprs, i, it) + { + rtx store_mem = it->store_mem; + HOST_WIDE_INT off_val; + + if (remove_rest) + { + it->remove = true; + removed_count++; + } + else if (is_store_forwarding (store_mem, load_mem, &off_val)) + { + /* Check if moving this store after the load is legal. */ + bool write_dep = false; + for (unsigned int j = store_exprs.length () - 1; j != i; j--) + if (!store_exprs[j].forwarded + && output_dependence (store_mem, + store_exprs[j].store_mem)) + { + write_dep = true; + break; + } + + if (!write_dep) + { + it->forwarded = true; + it->offset = off_val; + forwardings.safe_push (*it); + } + else + partial_forwarding = true; + + it->remove = true; + removed_count++; + } + else if (true_dependence (store_mem, GET_MODE (store_mem), + load_mem)) + { + /* We cannot keep a store forwarding candidate if it possibly + interferes with this load. */ + it->remove = true; + removed_count++; + remove_rest = true; + } + } + + if (!forwardings.is_empty () && !partial_forwarding) + process_forwardings (forwardings, insn); + } + else + { + rtx reg = SET_DEST (set); + + while (GET_CODE (reg) == ZERO_EXTRACT + || GET_CODE (reg) == STRICT_LOW_PART + || GET_CODE (reg) == SUBREG) + reg = XEXP (reg, 0); + + /* Drop store forwarding candidates when the address register is + overwritten. */ + if (REG_P (reg)) + { + bool remove_rest = false; + unsigned int i; + store_info *it; + FOR_EACH_VEC_ELT_REVERSE (store_exprs, i, it) + { + if (remove_rest + || reg_overlap_mentioned_p (reg, it->store_mem)) + { + it->remove = true; + removed_count++; + remove_rest = true; + } + } + } + else + { + /* We can't understand INSN. */ + store_exprs.truncate (0); + continue; + } + } + + if (removed_count) + { + unsigned int i, j; + store_info *it; + VEC_ORDERED_REMOVE_IF (store_exprs, i, j, it, it->remove); + } + + /* Don't consider store forwarding if the RTL instruction distance is + more than PARAM_STORE_FORWARDING_MAX_DISTANCE. */ + if (!store_exprs.is_empty () + && (store_exprs[0].insn_cnt + + param_store_forwarding_max_distance <= insn_cnt)) + store_exprs.ordered_remove (0); + + insn_cnt++; + } +} + +unsigned int +pass_rtl_avoid_store_forwarding::execute (function *fn) +{ + df_set_flags (DF_DEFER_INSN_RESCAN); + df_note_add_problem (); + + init_alias_analysis (); + cselib_init (CSELIB_RECORD_MEMORY | CSELIB_PRESERVE_CONSTANTS); + + stats_sf_detected = 0; + stats_sf_avoided = 0; + + basic_block bb; + FOR_EACH_BB_FN (bb, fn) + avoid_store_forwarding (bb); + + end_alias_analysis (); + cselib_finish (); + df_analyze (); + + statistics_counter_event (fn, "Store forwardings detected: ", + stats_sf_detected); + statistics_counter_event (fn, "Store forwardings avoided: ", + stats_sf_detected); + + return 0; +} + +} // anon namespace. + +rtl_opt_pass * +make_pass_rtl_avoid_store_forwarding (gcc::context *ctxt) +{ + return new pass_rtl_avoid_store_forwarding (ctxt); +} diff --git a/gcc/common.opt b/gcc/common.opt index 2c078fdd1f8..2fcf7170c2a 100644 --- a/gcc/common.opt +++ b/gcc/common.opt @@ -1747,6 +1747,10 @@ fgcse-sm Common Var(flag_gcse_sm) Init(0) Optimization Perform store motion after global common subexpression elimination. +favoid-store-forwarding +Common Var(flag_avoid_store_forwarding) Init(0) Optimization +Try to avoid store forwarding. + fgcse-las Common Var(flag_gcse_las) Init(0) Optimization Perform redundant load after store elimination in global common subexpression diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 4e8967fd8ab..c769744d178 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -12657,6 +12657,15 @@ loop unrolling. This option is enabled by default at optimization levels @option{-O1}, @option{-O2}, @option{-O3}, @option{-Os}. +@opindex favoid-store-forwarding +@item -favoid-store-forwarding +@itemx -fno-avoid-store-forwarding +Many CPUs will stall for many cycles when a load partially depends on previous +smaller stores. This pass tries to detect such cases and avoid the penalty by +changing the order of the load and store and then fixing up the loaded value. + +Disabled by default. + @opindex ffp-contract @item -ffp-contract=@var{style} @option{-ffp-contract=off} disables floating-point expression contraction. diff --git a/gcc/doc/passes.texi b/gcc/doc/passes.texi index b50d3d5635b..fca1d437331 100644 --- a/gcc/doc/passes.texi +++ b/gcc/doc/passes.texi @@ -977,6 +977,14 @@ and addressing mode selection. The pass is run twice, with values being propagated into loops only on the second run. The code is located in @file{fwprop.cc}. +@item Store forwarding avoidance + +This pass attempts to reduce the overhead of store to load forwarding. +It detects when a load reads from one or more previous smaller stores and +then rearranges them so that the stores are done after the load. The loaded +value is adjusted with a series of bit insert instructions so that it stays +the same. The code is located in @file{avoid-store-forwarding.cc}. + @item Common subexpression elimination This pass removes redundant computation within basic blocks, and diff --git a/gcc/params.opt b/gcc/params.opt index d34ef545bf0..b8115f5c27a 100644 --- a/gcc/params.opt +++ b/gcc/params.opt @@ -1032,6 +1032,10 @@ Allow the store merging pass to introduce unaligned stores if it is legal to do Common Joined UInteger Var(param_store_merging_max_size) Init(65536) IntegerRange(1, 65536) Param Optimization Maximum size of a single store merging region in bytes. +-param=store-forwarding-max-distance= +Common Joined UInteger Var(param_store_forwarding_max_distance) Init(10) IntegerRange(1, 1000) Param Optimization +Maximum number of instruction distance that a small store forwarded to a larger load may stall. + -param=switch-conversion-max-branch-ratio= Common Joined UInteger Var(param_switch_conversion_branch_ratio) Init(8) IntegerRange(1, 65536) Param Optimization The maximum ratio between array size and switch branches for a switch conversion to take place. diff --git a/gcc/passes.def b/gcc/passes.def index 1cbbd413097..1e608774707 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -462,6 +462,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_lower_subreg); NEXT_PASS (pass_df_initialize_opt); NEXT_PASS (pass_cse); + NEXT_PASS (pass_rtl_avoid_store_forwarding); NEXT_PASS (pass_rtl_fwprop); NEXT_PASS (pass_rtl_cprop); NEXT_PASS (pass_rtl_pre); diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c new file mode 100644 index 00000000000..4712f101d5e --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c @@ -0,0 +1,28 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */ + +typedef union { + char arr_8[8]; + long long_value; +} DataUnion; + +long ssll_1 (DataUnion *data, char x) +{ + data->arr_8[0] = x; + return data->long_value; +} + +long ssll_2 (DataUnion *data, char x) +{ + data->arr_8[1] = x; + return data->long_value; +} + +long ssll_3 (DataUnion *data, char x) +{ + data->arr_8[7] = x; + return data->long_value; +} + +/* { dg-final { scan-rtl-dump-times "Store forwarding detected" 3 "avoid_store_forwarding" } } */ +/* { dg-final { scan-rtl-dump-times "Store forwarding avoided" 3 "avoid_store_forwarding" } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c new file mode 100644 index 00000000000..b958612173b --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c @@ -0,0 +1,39 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */ + +typedef union { + char arr_8[8]; + int long_value; +} DataUnion1; + +long no_ssll_1 (DataUnion1 *data, char x) +{ + data->arr_8[4] = x; + return data->long_value; +} + +long no_ssll_2 (DataUnion1 *data, char x) +{ + data->arr_8[5] = x; + return data->long_value; +} + +typedef union { + char arr_8[8]; + short long_value[4]; +} DataUnion2; + +long no_ssll_3 (DataUnion2 *data, char x) +{ + data->arr_8[4] = x; + return data->long_value[1]; +} + +long no_ssll_4 (DataUnion2 *data, char x) +{ + data->arr_8[0] = x; + return data->long_value[1]; +} + +/* { dg-final { scan-rtl-dump-times "Store forwarding detected" 0 "avoid_store_forwarding" } } */ +/* { dg-final { scan-rtl-dump-times "Store forwarding avoided" 0 "avoid_store_forwarding" } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c new file mode 100644 index 00000000000..f4814c1a4d2 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c @@ -0,0 +1,31 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */ + +typedef union { + char arr_8[8]; + long long_value; +} DataUnion; + +long ssll_multi_1 (DataUnion *data, char x) +{ + data->arr_8[0] = x; + data->arr_8[2] = x; + return data->long_value; +} + +long ssll_multi_2 (DataUnion *data, char x) +{ + data->arr_8[0] = x; + data->arr_8[1] = 11; + return data->long_value; +} + +long ssll_multi_3 (DataUnion *data, char x, short y) +{ + data->arr_8[1] = x; + __builtin_memcpy(data->arr_8 + 4, &y, sizeof(short)); + return data->long_value; +} + +/* { dg-final { scan-rtl-dump-times "Store forwardings detected" 3 "avoid_store_forwarding" } } */ +/* { dg-final { scan-rtl-dump-times "Store forwardings avoided" 3 "avoid_store_forwarding" } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c new file mode 100644 index 00000000000..453d0e8974e --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c @@ -0,0 +1,24 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */ + +typedef int v4si __attribute__ ((vector_size (16))); + +typedef union { + char arr_16[16]; + v4si vec_value; +} DataUnion; + +v4si ssll_vect_1 (DataUnion *data, char x) +{ + data->arr_16[0] = x; + return data->vec_value; +} + +v4si ssll_vect_2 (DataUnion *data, int x) +{ + __builtin_memcpy(data->arr_16 + 4, &x, sizeof(int)); + return data->vec_value; +} + +/* { dg-final { scan-rtl-dump-times "Store forwarding detected" 2 "avoid_store_forwarding" } } */ +/* { dg-final { scan-rtl-dump-times "Store forwarding avoided" 2 "avoid_store_forwarding" } } */ diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h index 29267589eeb..49957ba3373 100644 --- a/gcc/tree-pass.h +++ b/gcc/tree-pass.h @@ -570,6 +570,7 @@ extern rtl_opt_pass *make_pass_rtl_dse3 (gcc::context *ctxt); extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt); extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt); extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt); +extern rtl_opt_pass *make_pass_rtl_avoid_store_forwarding (gcc::context *ctxt); extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt); extern rtl_opt_pass *make_pass_cse_after_global_opts (gcc::context *ctxt); extern rtl_opt_pass *make_pass_rtl_ifcvt (gcc::context *ctxt);
This pass detects cases of expensive store forwarding and tries to avoid them by reordering the stores and using suitable bit insertion sequences. For example it can transform this: strb w2, [x1, 1] ldr x0, [x1] # Expensive store forwarding to larger load. To: ldr x0, [x1] strb w2, [x1] bfi x0, x2, 0, 8 Assembly like this can appear with bitfields or type punning / unions. On stress-ng when running the cpu-union microbenchmark the following speedups have been observed. Neoverse-N1: +29.4% Intel Coffeelake: +13.1% AMD 5950X: +17.5% gcc/ChangeLog: * Makefile.in: Add avoid-store-forwarding.o. * common.opt: New option -favoid-store-forwarding. * params.opt: New param store-forwarding-max-distance. * doc/invoke.texi: Document new pass. * doc/passes.texi: Document new pass. * passes.def: Schedule a new pass. * tree-pass.h (make_pass_rtl_avoid_store_forwarding): Declare. * avoid-store-forwarding.cc: New file. gcc/testsuite/ChangeLog: * gcc.target/aarch64/avoid-store-forwarding-1.c: New test. * gcc.target/aarch64/avoid-store-forwarding-2.c: New test. * gcc.target/aarch64/avoid-store-forwarding-3.c: New test. * gcc.target/aarch64/avoid-store-forwarding-4.c: New test. Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu> --- Changes in v2: - Allow modes that are not scalar_int_mode. - Introduce simple costing to avoid unprofitable transformations. - Reject bit insert sequences that spill to memory. - Document new pass. - Fix and add testcases. gcc/Makefile.in | 1 + gcc/avoid-store-forwarding.cc | 578 ++++++++++++++++++ gcc/common.opt | 4 + gcc/doc/invoke.texi | 9 + gcc/doc/passes.texi | 8 + gcc/params.opt | 4 + gcc/passes.def | 1 + .../aarch64/avoid-store-forwarding-1.c | 28 + .../aarch64/avoid-store-forwarding-2.c | 39 ++ .../aarch64/avoid-store-forwarding-3.c | 31 + .../aarch64/avoid-store-forwarding-4.c | 24 + gcc/tree-pass.h | 1 + 12 files changed, 728 insertions(+) create mode 100644 gcc/avoid-store-forwarding.cc create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c