Message ID | 1447431522-4695-1-git-send-email-alan.lawrence@arm.com |
---|---|
State | New |
Headers | show |
On 13 November 2015 at 17:18, Alan Lawrence <alan.lawrence@arm.com> wrote: > On 10/11/15 12:51, Richard Biener wrote: >>> >>> Just noticing this... if we have a vectorization factor of 4 and matches >>> is 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0 then this will split into 1, 1, 1, 1 and >>> 1, 1, 0, 0, 0, ... where we know from the matches that it will again fail? >>> >>> Thus shouldn't we split either only if i % vectorization_factor is 0 or >>> if not, split "twice", dropping the intermediate surely non-matching >>> group of vectorization_factor size? After all if we split like with the >>> patch then the upper half will _not_ be splitted again with the >>> simplified patch (result will be 1, 1, 0, 0, 0, 0, 0, 0 again). >>> >>> So I expect that the statistics will be the same if we restrict splitting >>> to the i % vectorization_factor == 0 case, or rather split where we do >>> now but only re-analyze group2 if i % vectorization_factor == 0 holds? >>> >>> Ok with that change. Improvements on that incrementally. >> >> Btw, it just occurs to me that the whole thing is equivalent to splitting >> the store-group into vector-size pieces up-front? That way we do >> the maximum splitting up-frond and avoid any redundant work? >> >> The patch is still ok as said, just the above may be a simple thing >> to explore. > > I'd refrained from splitting in vect_analyze_group_access_1 as my understanding > was that we only did that once, whereas we would retry the > vect_analyze_slp_instance path each time we decreased the > vectorization_factor...however, I did try putting code at the beginning of > vect_analyze_slp_instance to split up any groups > vf. Unfortunately this loses > us some previously-successful SLPs, as some bigger groups cannot be SLPed if we > split them as they require 'unrolling'...so not addressing that here. > > However your suggestion of splitting twice when we know the boundary is in the > middle of a vector is a nice compromise; it nets us a good number more > successes in SPEC2000 and SPEC2006, about 7% more than without the patch. > > Hence, here's the patch I've committed, as r230330, after regstrap on x86_64 > and AArch64. (I dropped the previous bb-slp-subgroups-2 and renamed the others > up as we don't do that one anymore.) > > Cheers, Alan > > gcc/ChangeLog: > > PR tree-optimization/67682 > * tree-vect-slp.c (vect_split_slp_store_group): New. > (vect_analyze_slp_instance): During basic block SLP, recurse on > subgroups if vect_build_slp_tree fails after 1st vector. > > gcc/testsuite/ChangeLog: > > PR tree-optimization/67682 > * gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic. > * gcc.dg/vect/bb-slp-subgroups-1.c: New. > * gcc.dg/vect/bb-slp-subgroups-2.c: New. > * gcc.dg/vect/bb-slp-subgroups-3.c: New. Hi Alan, I've noticed that this new test (gcc.dg/vect/bb-slp-subgroups-3.c) fails for armeb targets. I haven't had time to look at more details yet, but I guess you can reproduce it quickly enough. > --- > gcc/testsuite/gcc.dg/vect/bb-slp-7.c | 10 +-- > gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++ > gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 41 +++++++++++++ > gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 +++++++++++++ > gcc/tree-vect-slp.c | 85 +++++++++++++++++++++++++- > 5 files changed, 215 insertions(+), 6 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c > create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c > create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c > > diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c > index ab54a48..b8bef8c 100644 > --- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c > +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c > @@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y) > unsigned int *pout = &out[0]; > unsigned int a0, a1, a2, a3; > > - /* Non isomorphic. */ > + /* Non isomorphic, even 64-bit subgroups. */ > a0 = *pin++ + 23; > - a1 = *pin++ + 142; > + a1 = *pin++ * 142; > a2 = *pin++ + 2; > a3 = *pin++ * 31; > - > + > *pout++ = a0 * x; > *pout++ = a1 * y; > *pout++ = a2 * x; > @@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y) > > /* Check results. */ > if (out[0] != (in[0] + 23) * x > - || out[1] != (in[1] + 142) * y > + || out[1] != (in[1] * 142) * y > || out[2] != (in[2] + 2) * x > || out[3] != (in[3] * 31) * y) > abort(); > @@ -47,4 +47,4 @@ int main (void) > } > > /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */ > - > + > diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c > new file mode 100644 > index 0000000..39c23c3 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c > @@ -0,0 +1,44 @@ > +/* { dg-require-effective-target vect_int } */ > +/* PR tree-optimization/67682. */ > + > +#include "tree-vect.h" > + > +int __attribute__((__aligned__(8))) a[8]; > +int __attribute__((__aligned__(8))) b[4]; > + > +__attribute__ ((noinline)) void > +test () > +{ > + a[0] = b[0]; > + a[1] = b[1]; > + a[2] = b[2]; > + a[3] = b[3]; > + a[4] = 0; > + a[5] = 0; > + a[6] = 0; > + a[7] = 0; > +} > + > +int > +main (int argc, char **argv) > +{ > + check_vect (); > + > + for (int i = 0; i < 8; i++) > + a[i] = 1; > + for (int i = 0; i < 4; i++) > + b[i] = i + 4; > + __asm__ volatile ("" : : : "memory"); > + test (a, b); > + __asm__ volatile ("" : : : "memory"); > + for (int i = 0; i < 4; i++) > + if (a[i] != i+4) > + abort (); > + for (int i = 4; i < 8; i++) > + if (a[i] != 0) > + abort (); > + return 0; > +} > + > +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c > new file mode 100644 > index 0000000..13c51f3 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c > @@ -0,0 +1,41 @@ > +/* { dg-require-effective-target vect_int } */ > +/* PR tree-optimization/67682. */ > + > +#include "tree-vect.h" > + > +int __attribute__((__aligned__(8))) a[8]; > +int __attribute__((__aligned__(8))) b[4]; > + > +__attribute__ ((noinline)) void > +test () > +{ > + a[0] = b[2] + 1; > + a[1] = b[0] + 2; > + a[2] = b[1] + 3; > + a[3] = b[1] + 4; > + a[4] = b[3] * 3; > + a[5] = b[0] * 4; > + a[6] = b[2] * 5; > + a[7] = b[1] * 7; > +} > + > +int > +main (int argc, char **argv) > +{ > + check_vect (); > + > + for (int i = 0; i < 8; i++) > + a[i] = 1; > + for (int i = 0; i < 4; i++) > + b[i] = i + 4; > + __asm__ volatile ("" : : : "memory"); > + test (a, b); > + __asm__ volatile ("" : : : "memory"); > + if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9) > + || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35)) > + abort (); > + return 0; > +} > + > +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c > new file mode 100644 > index 0000000..6ae9a89 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c > @@ -0,0 +1,41 @@ > +/* { dg-require-effective-target vect_int } */ > +/* PR tree-optimization/67682. */ > + > +#include "tree-vect.h" > + > +int __attribute__((__aligned__(8))) a[8]; > +int __attribute__((__aligned__(8))) b[8]; > + > +__attribute__ ((noinline)) void > +test () > +{ > + a[0] = b[0] + 1; > + a[1] = b[1] + 2; > + a[2] = b[2] + 3; > + a[3] = b[3] + 4; > + a[4] = b[0] * 3; > + a[5] = b[2] * 4; > + a[6] = b[4] * 5; > + a[7] = b[6] * 7; > +} > + > +int > +main (int argc, char **argv) > +{ > + check_vect (); > + > + for (int i = 0; i < 8; i++) > + a[i] = 1; > + for (int i = 0; i < 8; i++) > + b[i] = i + 4; > + __asm__ volatile ("" : : : "memory"); > + test (a, b); > + __asm__ volatile ("" : : : "memory"); > + if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11) > + || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70)) > + abort (); > + return 0; > +} > + > +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */ > diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c > index cfdfc29..65a183f 100644 > --- a/gcc/tree-vect-slp.c > +++ b/gcc/tree-vect-slp.c > @@ -1606,6 +1606,54 @@ vect_analyze_slp_cost (slp_instance instance, void *data) > body_cost_vec.release (); > } > > +/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups: > + one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing > + the first GROUP1_SIZE stmts, since stores are consecutive), the second > + containing the remainder. > + Return the first stmt in the second group. */ > + > +static gimple * > +vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size) > +{ > + stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt); > + gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt); > + gcc_assert (group1_size > 0); > + int group2_size = GROUP_SIZE (first_vinfo) - group1_size; > + gcc_assert (group2_size > 0); > + GROUP_SIZE (first_vinfo) = group1_size; > + > + gimple *stmt = first_stmt; > + for (unsigned i = group1_size; i > 1; i--) > + { > + stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)); > + gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1); > + } > + /* STMT is now the last element of the first group. */ > + gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)); > + GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0; > + > + GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size; > + for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt))) > + { > + GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2; > + gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1); > + } > + > + /* For the second group, the GROUP_GAP is that before the original group, > + plus skipping over the first vector. */ > + GROUP_GAP (vinfo_for_stmt (group2)) = > + GROUP_GAP (first_vinfo) + group1_size; > + > + /* GROUP_GAP of the first group now has to skip over the second group too. */ > + GROUP_GAP (first_vinfo) += group2_size; > + > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_NOTE, vect_location, "Split group into %d and %d\n", > + group1_size, group2_size); > + > + return group2; > +} > + > /* Analyze an SLP instance starting from a group of grouped stores. Call > vect_build_slp_tree to build a tree of packed stmts if possible. > Return FALSE if it's impossible to SLP any stmt in the loop. */ > @@ -1621,7 +1669,7 @@ vect_analyze_slp_instance (vec_info *vinfo, > tree vectype, scalar_type = NULL_TREE; > gimple *next; > unsigned int vectorization_factor = 0; > - int i; > + unsigned int i; > unsigned int max_nunits = 0; > vec<slp_tree> loads; > struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt)); > @@ -1811,6 +1859,41 @@ vect_analyze_slp_instance (vec_info *vinfo, > vect_free_slp_tree (node); > loads.release (); > > + /* For basic block SLP, try to break the group up into multiples of the > + vectorization factor. */ > + if (is_a <bb_vec_info> (vinfo) > + && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) > + && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt))) > + { > + /* We consider breaking the group only on VF boundaries from the existing > + start. */ > + for (i = 0; i < group_size; i++) > + if (!matches[i]) break; > + > + if (i >= vectorization_factor && i < group_size) > + { > + /* Split into two groups at the first vector boundary before i. */ > + gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0); > + unsigned group1_size = i & ~(vectorization_factor - 1); > + > + gimple *rest = vect_split_slp_store_group (stmt, group1_size); > + bool res = vect_analyze_slp_instance (vinfo, stmt, max_tree_size); > + /* If the first non-match was in the middle of a vector, > + skip the rest of that vector. */ > + if (group1_size < i) > + { > + i = group1_size + vectorization_factor; > + if (i < group_size) > + rest = vect_split_slp_store_group (rest, vectorization_factor); > + } > + if (i < group_size) > + res |= vect_analyze_slp_instance (vinfo, rest, max_tree_size); > + return res; > + } > + /* Even though the first vector did not all match, we might be able to SLP > + (some) of the remainder. FORNOW ignore this possibility. */ > + } > + > return false; > } > > -- > 1.9.1 >
On 16/11/15 14:42, Christophe Lyon wrote: > > Hi Alan, > > I've noticed that this new test (gcc.dg/vect/bb-slp-subgroups-3.c) > fails for armeb targets. > I haven't had time to look at more details yet, but I guess you can > reproduce it quickly enough. > Thanks - yes I see it now. -fdump-tree-optimized looks sensible: __attribute__((noinline)) test () { vector(4) int vect__14.21; vector(4) int vect__2.20; vector(4) int vect__2.19; vector(4) int vect__3.13; vector(4) int vect__2.12; <bb 2>: vect__2.12_24 = MEM[(int *)&b]; vect__3.13_27 = vect__2.12_24 + { 1, 2, 3, 4 }; MEM[(int *)&a] = vect__3.13_27; vect__2.19_31 = MEM[(int *)&b + 16B]; vect__2.20_33 = VEC_PERM_EXPR <vect__2.12_24, vect__2.19_31, { 0, 2, 4, 6 }>; vect__14.21_35 = vect__2.20_33 * { 3, 4, 5, 7 }; MEM[(int *)&a + 16B] = vect__14.21_35; return; } but while a[0...3] end up containing 5 7 9 11 as expected, a[4..7] end up with 30 32 30 28 rather than the expected 12 24 40 70. That is, we end up with (10 8 6 4), rather than the expected (4 6 8 10), being multiplied by {3,4,5,7}. Looking at the RTL, those values come from a UZP1/2 pair that should extract elements {0,2,4,6} of b. Assembler, with my workings as to what's in each register: test: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. movw r2, #:lower16:b movt r2, #:upper16:b vldr d22, .L11 vldr d23, .L11+8 ;; So d22 = (3 4), d23 = (5 7), q11 = (5 7 3 4) movw r3, #:lower16:a movt r3, #:upper16:a vld1.64 {d16-d17}, [r2:64] ;; So d16 = (b[0] b[1]), d17 = (b[2] b[3]), q8 = (b[2] b[3] b[0] b[1]) vmov q9, q8 @ v4si ;; q9 = (b[2] b[3] b[0] b[1]) vldr d20, [r2, #16] vldr d21, [r2, #24] ;; So d20 = (b[4] b[5]), d21 = (b[6] b[7]), q10 = (b[6] b[7] b[4] b[5]) vuzp.32 q10, q9 ;; So q10 = (b[3] b[1] b[7] b[5]), i.e. d20 = (b[7] b[5]) and d21 = (b[3] b[1]) ;; and q9 = (b[2] b[0] b[6] b[4]), i.e. d18 = (b[6] b[4]) and d19 = (b[2] b[0]) vldr d20, .L11+16 vldr d21, .L11+24 ;; d20 = (1 2), d21 = (3 4), q10 = (3 4 1 2) vmul.i32 q9, q9, q11 ;; q9 = (b[2]*5 b[0]*7 b[6]*3 b[4]*4) ;; i.e. d18 = (b[6]*3 b[4]*4) and d19 = (b[2]*5 b[0]*7) vadd.i32 q8, q8, q10 ;; q8 = (b[2]+3 b[3]+4 b[0]+1 b[1]+2) ;; i.e. d16 = (b[0]+1 b[1]+2), d17 = (b[2]+3 b[3]+4) vst1.64 {d16-d17}, [r3:64] ;; a[0] = b[0]+1, a[1] = b[1]+2, a[2] = b[2]+3, a[3]=b[3]+4 all ok vstr d18, [r3, #16] ;; a[4] = b[6]*3, a[5] = b[4]*4 vstr d19, [r3, #24] ;; a[6] = b[2]*5, a[7] = b[0]*7 bx lr .L12: .align 3 .L11: .word 3 .word 4 .word 5 .word 7 .word 1 .word 2 .word 3 .word 4 Which is to say - the bit order in the q-registers, is neither big- nor little-endian, but the elements get stored back to memory in a consistent order with how they were loaded, so we're OK as long as there are no permutes. Unfortunately for UZP this lane ordering mixup is not idempotent and messes everything up... Hmmm. I'm feeling that "properly" fixing this testcase, amounts to fixing armeb's whole register-numbering/lane-flipping scheme, and might be quite a large task. OTOH it might also fix the significant number of failing vectorizer tests. A simpler solution might be to disable...some part of vector support....on armeb, but I'm not sure which part would be best, yet. Thoughts (CC maintainers)? --Alan
On Tue, Nov 17, 2015 at 5:46 PM, Alan Lawrence <alan.lawrence@arm.com> wrote: > On 16/11/15 14:42, Christophe Lyon wrote: >> >> >> Hi Alan, >> >> I've noticed that this new test (gcc.dg/vect/bb-slp-subgroups-3.c) >> fails for armeb targets. >> I haven't had time to look at more details yet, but I guess you can >> reproduce it quickly enough. >> > > Thanks - yes I see it now. > > -fdump-tree-optimized looks sensible: > > __attribute__((noinline)) > test () > { > vector(4) int vect__14.21; > vector(4) int vect__2.20; > vector(4) int vect__2.19; > vector(4) int vect__3.13; > vector(4) int vect__2.12; > > <bb 2>: > vect__2.12_24 = MEM[(int *)&b]; > vect__3.13_27 = vect__2.12_24 + { 1, 2, 3, 4 }; > MEM[(int *)&a] = vect__3.13_27; > vect__2.19_31 = MEM[(int *)&b + 16B]; > vect__2.20_33 = VEC_PERM_EXPR <vect__2.12_24, vect__2.19_31, { 0, 2, 4, 6 > }>; > vect__14.21_35 = vect__2.20_33 * { 3, 4, 5, 7 }; > MEM[(int *)&a + 16B] = vect__14.21_35; > return; > } > > but while a[0...3] end up containing 5 7 9 11 as expected, > a[4..7] end up with 30 32 30 28 rather than the expected 12 24 40 70. > That is, we end up with (10 8 6 4), rather than the expected (4 6 8 10), > being multiplied by {3,4,5,7}. Looking at the RTL, those values come from a > UZP1/2 pair that should extract elements {0,2,4,6} of b. Assembler, with my > workings as to what's in each register: > > test: > @ args = 0, pretend = 0, frame = 0 > @ frame_needed = 0, uses_anonymous_args = 0 > @ link register save eliminated. > movw r2, #:lower16:b > movt r2, #:upper16:b > vldr d22, .L11 > vldr d23, .L11+8 > ;; So d22 = (3 4), d23 = (5 7), q11 = (5 7 3 4) > movw r3, #:lower16:a > movt r3, #:upper16:a > vld1.64 {d16-d17}, [r2:64] > ;; So d16 = (b[0] b[1]), d17 = (b[2] b[3]), q8 = (b[2] b[3] b[0] b[1]) > vmov q9, q8 @ v4si > ;; q9 = (b[2] b[3] b[0] b[1]) > vldr d20, [r2, #16] > vldr d21, [r2, #24] > ;; So d20 = (b[4] b[5]), d21 = (b[6] b[7]), q10 = (b[6] b[7] b[4] b[5]) > vuzp.32 q10, q9 > ;; So q10 = (b[3] b[1] b[7] b[5]), i.e. d20 = (b[7] b[5]) and d21 = (b[3] > b[1]) > ;; and q9 = (b[2] b[0] b[6] b[4]), i.e. d18 = (b[6] b[4]) and d19 = (b[2] > b[0]) > vldr d20, .L11+16 > vldr d21, .L11+24 > ;; d20 = (1 2), d21 = (3 4), q10 = (3 4 1 2) > vmul.i32 q9, q9, q11 > ;; q9 = (b[2]*5 b[0]*7 b[6]*3 b[4]*4) > ;; i.e. d18 = (b[6]*3 b[4]*4) and d19 = (b[2]*5 b[0]*7) > vadd.i32 q8, q8, q10 > ;; q8 = (b[2]+3 b[3]+4 b[0]+1 b[1]+2) > ;; i.e. d16 = (b[0]+1 b[1]+2), d17 = (b[2]+3 b[3]+4) > vst1.64 {d16-d17}, [r3:64] > ;; a[0] = b[0]+1, a[1] = b[1]+2, a[2] = b[2]+3, a[3]=b[3]+4 all ok > vstr d18, [r3, #16] > ;; a[4] = b[6]*3, a[5] = b[4]*4 > vstr d19, [r3, #24] > ;; a[6] = b[2]*5, a[7] = b[0]*7 > bx lr > .L12: > .align 3 > .L11: > .word 3 > .word 4 > .word 5 > .word 7 > .word 1 > .word 2 > .word 3 > .word 4 > > Which is to say - the bit order in the q-registers, is neither big- nor > little-endian, but the elements get stored back to memory in a consistent > order with how they were loaded, so we're OK as long as there are no > permutes. Unfortunately for UZP this lane ordering mixup is not idempotent > and messes everything up... > > Hmmm. I'm feeling that "properly" fixing this testcase, amounts to fixing > armeb's whole register-numbering/lane-flipping scheme, and might be quite a > large task. OTOH it might also fix the significant number of failing > vectorizer tests. A simpler solution might be to disable...some part of > vector support....on armeb, but I'm not sure which part would be best, yet. > > Thoughts (CC maintainers)? As most of the permutation support in the arm backend is disabled for armeb I suppose disabling some more of it is appropriate ... Or finally sitting down and fixing the mess. For constant permutes fixing it on the constant side would be an option I guess. Richard. > --Alan >
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c index ab54a48..b8bef8c 100644 --- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c @@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y) unsigned int *pout = &out[0]; unsigned int a0, a1, a2, a3; - /* Non isomorphic. */ + /* Non isomorphic, even 64-bit subgroups. */ a0 = *pin++ + 23; - a1 = *pin++ + 142; + a1 = *pin++ * 142; a2 = *pin++ + 2; a3 = *pin++ * 31; - + *pout++ = a0 * x; *pout++ = a1 * y; *pout++ = a2 * x; @@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y) /* Check results. */ if (out[0] != (in[0] + 23) * x - || out[1] != (in[1] + 142) * y + || out[1] != (in[1] * 142) * y || out[2] != (in[2] + 2) * x || out[3] != (in[3] * 31) * y) abort(); @@ -47,4 +47,4 @@ int main (void) } /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */ - + diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c new file mode 100644 index 0000000..39c23c3 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c @@ -0,0 +1,44 @@ +/* { dg-require-effective-target vect_int } */ +/* PR tree-optimization/67682. */ + +#include "tree-vect.h" + +int __attribute__((__aligned__(8))) a[8]; +int __attribute__((__aligned__(8))) b[4]; + +__attribute__ ((noinline)) void +test () +{ + a[0] = b[0]; + a[1] = b[1]; + a[2] = b[2]; + a[3] = b[3]; + a[4] = 0; + a[5] = 0; + a[6] = 0; + a[7] = 0; +} + +int +main (int argc, char **argv) +{ + check_vect (); + + for (int i = 0; i < 8; i++) + a[i] = 1; + for (int i = 0; i < 4; i++) + b[i] = i + 4; + __asm__ volatile ("" : : : "memory"); + test (a, b); + __asm__ volatile ("" : : : "memory"); + for (int i = 0; i < 4; i++) + if (a[i] != i+4) + abort (); + for (int i = 4; i < 8; i++) + if (a[i] != 0) + abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c new file mode 100644 index 0000000..13c51f3 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c @@ -0,0 +1,41 @@ +/* { dg-require-effective-target vect_int } */ +/* PR tree-optimization/67682. */ + +#include "tree-vect.h" + +int __attribute__((__aligned__(8))) a[8]; +int __attribute__((__aligned__(8))) b[4]; + +__attribute__ ((noinline)) void +test () +{ + a[0] = b[2] + 1; + a[1] = b[0] + 2; + a[2] = b[1] + 3; + a[3] = b[1] + 4; + a[4] = b[3] * 3; + a[5] = b[0] * 4; + a[6] = b[2] * 5; + a[7] = b[1] * 7; +} + +int +main (int argc, char **argv) +{ + check_vect (); + + for (int i = 0; i < 8; i++) + a[i] = 1; + for (int i = 0; i < 4; i++) + b[i] = i + 4; + __asm__ volatile ("" : : : "memory"); + test (a, b); + __asm__ volatile ("" : : : "memory"); + if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9) + || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35)) + abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c new file mode 100644 index 0000000..6ae9a89 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c @@ -0,0 +1,41 @@ +/* { dg-require-effective-target vect_int } */ +/* PR tree-optimization/67682. */ + +#include "tree-vect.h" + +int __attribute__((__aligned__(8))) a[8]; +int __attribute__((__aligned__(8))) b[8]; + +__attribute__ ((noinline)) void +test () +{ + a[0] = b[0] + 1; + a[1] = b[1] + 2; + a[2] = b[2] + 3; + a[3] = b[3] + 4; + a[4] = b[0] * 3; + a[5] = b[2] * 4; + a[6] = b[4] * 5; + a[7] = b[6] * 7; +} + +int +main (int argc, char **argv) +{ + check_vect (); + + for (int i = 0; i < 8; i++) + a[i] = 1; + for (int i = 0; i < 8; i++) + b[i] = i + 4; + __asm__ volatile ("" : : : "memory"); + test (a, b); + __asm__ volatile ("" : : : "memory"); + if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11) + || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70)) + abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */ diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index cfdfc29..65a183f 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -1606,6 +1606,54 @@ vect_analyze_slp_cost (slp_instance instance, void *data) body_cost_vec.release (); } +/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups: + one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing + the first GROUP1_SIZE stmts, since stores are consecutive), the second + containing the remainder. + Return the first stmt in the second group. */ + +static gimple * +vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size) +{ + stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt); + gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt); + gcc_assert (group1_size > 0); + int group2_size = GROUP_SIZE (first_vinfo) - group1_size; + gcc_assert (group2_size > 0); + GROUP_SIZE (first_vinfo) = group1_size; + + gimple *stmt = first_stmt; + for (unsigned i = group1_size; i > 1; i--) + { + stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)); + gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1); + } + /* STMT is now the last element of the first group. */ + gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)); + GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0; + + GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size; + for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt))) + { + GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2; + gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1); + } + + /* For the second group, the GROUP_GAP is that before the original group, + plus skipping over the first vector. */ + GROUP_GAP (vinfo_for_stmt (group2)) = + GROUP_GAP (first_vinfo) + group1_size; + + /* GROUP_GAP of the first group now has to skip over the second group too. */ + GROUP_GAP (first_vinfo) += group2_size; + + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, "Split group into %d and %d\n", + group1_size, group2_size); + + return group2; +} + /* Analyze an SLP instance starting from a group of grouped stores. Call vect_build_slp_tree to build a tree of packed stmts if possible. Return FALSE if it's impossible to SLP any stmt in the loop. */ @@ -1621,7 +1669,7 @@ vect_analyze_slp_instance (vec_info *vinfo, tree vectype, scalar_type = NULL_TREE; gimple *next; unsigned int vectorization_factor = 0; - int i; + unsigned int i; unsigned int max_nunits = 0; vec<slp_tree> loads; struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt)); @@ -1811,6 +1859,41 @@ vect_analyze_slp_instance (vec_info *vinfo, vect_free_slp_tree (node); loads.release (); + /* For basic block SLP, try to break the group up into multiples of the + vectorization factor. */ + if (is_a <bb_vec_info> (vinfo) + && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) + && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt))) + { + /* We consider breaking the group only on VF boundaries from the existing + start. */ + for (i = 0; i < group_size; i++) + if (!matches[i]) break; + + if (i >= vectorization_factor && i < group_size) + { + /* Split into two groups at the first vector boundary before i. */ + gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0); + unsigned group1_size = i & ~(vectorization_factor - 1); + + gimple *rest = vect_split_slp_store_group (stmt, group1_size); + bool res = vect_analyze_slp_instance (vinfo, stmt, max_tree_size); + /* If the first non-match was in the middle of a vector, + skip the rest of that vector. */ + if (group1_size < i) + { + i = group1_size + vectorization_factor; + if (i < group_size) + rest = vect_split_slp_store_group (rest, vectorization_factor); + } + if (i < group_size) + res |= vect_analyze_slp_instance (vinfo, rest, max_tree_size); + return res; + } + /* Even though the first vector did not all match, we might be able to SLP + (some) of the remainder. FORNOW ignore this possibility. */ + } + return false; }