Message ID | patch-18818-tamar@arm.com |
---|---|
State | New |
Headers | show |
Series | middle-end: support SLP early break | expand |
On Tue, 1 Oct 2024, Tamar Christina wrote: > Hi all, > > This patch introduces feature parity for early break int the SLP only > vectorizer. > > The approach taken here is to treat the early exits as root statements for an > SLP tree. This means that we don't need any changes to build_slp to support > gconds. > > Codegen for the gcond itself now has to be done out of line but the body of the > SLP blocks itself is simply driven by SLP scheduling. There is a slight > awkwardness in having re-used vectorizable_early_exit for both SLP and non-SLP > but I've documented the differences and when I did try to refactor it it wasn't > really worth it given that this is a temporary state anyway. > > This version is restricted to lane = 1, as such we can re-use the existing > move_early_break function instead of having to do safety update through > scheduling. I have a branch where I'm working on that but lane > 1 is out of > scope for GCC 15 anyway. The only reason I will try to get moving through > scheduling done as a stretch goal is so we get epilogue vectorization back for > early break. > > The example: > > unsigned test4(unsigned x) > { > unsigned ret = 0; > for (int i = 0; i < N; i++) > { > vect_b[i] = x + i; > if (vect_a[i]*2 != x) > break; > vect_a[i] = x; > > } > return ret; > } > > builds the following SLP instance for early break: > > note: Analyzing vectorizable control flow: if (patt_6 != 0) > note: Starting SLP discovery for > note: patt_6 = _4 != x_9(D); > note: starting SLP discovery for node 0x63abc80 > note: Build SLP for patt_6 = _4 != x_9(D); > note: precomputed vectype: vector(4) <signed-boolean:32> > note: nunits = 4 > note: vect_is_simple_use: operand x_9(D), type of def: external > note: vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, +INF] MASK 0xffff > _3 * 2, type of def: internal > note: starting SLP discovery for node 0x63abdc0 > note: Build SLP for _4 = _3 * 2; > note: precomputed vectype: vector(4) unsigned int > note: nunits = 4 > note: vect_is_simple_use: operand # > vect_aD.4416[i_15], type of def: internal > note: vect_is_simple_use: operand 2, type of def: constant > note: starting SLP discovery for node 0x63abe60 > note: Build SLP for _3 = vect_a[i_15]; > note: precomputed vectype: vector(4) unsigned int > note: nunits = 4 > note: SLP discovery for node 0x63abe60 succeeded > note: SLP discovery for node 0x63abdc0 succeeded > note: SLP discovery for node 0x63abc80 succeeded > note: SLP size 3 vs. limit 10. > note: Final SLP tree for instance 0x6474190: > note: node 0x63abc80 (max_nunits=4, refcnt=2) vector(4) <signed-boolean:32> > note: op template: patt_6 = _4 != x_9(D); > note: stmt 0 patt_6 = _4 != x_9(D); > note: children 0x63abd20 0x63abdc0 > note: node (external) 0x63abd20 (max_nunits=1, refcnt=1) > note: { x_9(D) } > note: node 0x63abdc0 (max_nunits=4, refcnt=2) vector(4) unsigned int > note: op template: _4 = _3 * 2; > note: stmt 0 _4 = _3 * 2; > note: children 0x63abe60 0x63abf00 > note: node 0x63abe60 (max_nunits=4, refcnt=2) vector(4) unsigned int > note: op template: _3 = vect_a[i_15]; > note: stmt 0 _3 = vect_a[i_15]; > note: load permutation { 0 } > note: node (constant) 0x63abf00 (max_nunits=1, refcnt=1) > note: { 2 } > > and during codegen: > > note: ------>vectorizing SLP node starting from: patt_6 = _4 != x_9(D); > note: vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, +INF] MASK 0xffff > _3 * 2, type of def: internal > note: add new stmt: mask_patt_6.18_58 = _53 != vect__4.17_57; > note: === vectorizable_early_exit === > note: transform early-exit. > note: vectorizing stmts using SLP. > note: Vectorizing SLP tree: > note: node 0x63abfa0 (max_nunits=4, refcnt=1) vector(4) int > note: op template: i_12 = i_15 + 1; > note: stmt 0 i_12 = i_15 + 1; > note: children 0x63aba00 0x63ac040 > note: node 0x63aba00 (max_nunits=4, refcnt=2) vector(4) int > note: op template: i_15 = PHI <i_12(6), 0(14)> > note: [l] stmt 0 i_15 = PHI <i_12(6), 0(14)> > note: children (nil) (nil) > note: node (constant) 0x63ac040 (max_nunits=1, refcnt=1) vector(4) int > note: { 1 } > > Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf, > x86_64-pc-linux-gnu -m32, -m64 and no issues. > > Also bootstrapped --with-build-config='bootstrap-O3 bootstrap-lto' > --enable-checking=release,yes,rtl,extra on aarch64-none-linux-gnu and > x86_64-pc-linux-gnu -m32, -m64 and no issues. > > Ok for master? > > gcc/ChangeLog: > > * tree-vectorizer.h (enum slp_instance_kind): Add slp_inst_kind_gcond. > (LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS): New. > (vectorizable_early_exit): Expose. > (class _loop_vec_info): Add early_break_live_stmts. > * tree-vect-slp.cc (vect_build_slp_instance, vect_analyze_slp_instance): > Support gcond instances. > (vect_analyze_slp): Analyze gcond roots and early break live statements. > (maybe_push_to_hybrid_worklist): Don't sink gconds. > (vect_slp_analyze_node_operations): Support gconds. > (vect_slp_check_for_roots): Update comments. > (vectorize_slp_instance_root_stmt): Support gconds. > (vect_schedule_slp): Pass vinfo to vectorize_slp_instance_root_stmt. > * tree-vect-stmts.cc (vect_stmt_relevant_p): Record early break live > statements. > (vectorizable_early_exit): Support SLP. > > gcc/testsuite/ChangeLog: > > * gcc.dg/vect/vect-early-break_126.c: New test. > * gcc.dg/vect/vect-early-break_127.c: New test. > * gcc.dg/vect/vect-early-break_128.c: New test. > > --- > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c > new file mode 100644 > index 0000000000000000000000000000000000000000..4bfc9880f9fc869bf616123ff509d13be17ffacf > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c > @@ -0,0 +1,28 @@ > +/* { dg-do compile } */ > +/* { dg-add-options vect_early_break } */ > +/* { dg-require-effective-target vect_early_break } */ > +/* { dg-require-effective-target vect_int } */ > + > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ > + > +#define N 1024 > +unsigned vect_a[N]; > +unsigned vect_b[N]; > + > +unsigned test4(unsigned x) > +{ > + unsigned ret = 0; > + for (int i = 0; i < N; i++) > + { > + vect_b[i] = x + i; > + if (vect_a[i] > x) > + { > + ret *= vect_a[i]; > + return vect_a[i]; > + } > + vect_a[i] = x; > + ret += vect_a[i] + vect_b[i]; > + } > + return ret; > +} > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c > new file mode 100644 > index 0000000000000000000000000000000000000000..67cb5d34a77192e5d7d72c35df8e83535ef184ab > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c > @@ -0,0 +1,27 @@ > +/* { dg-do compile } */ > +/* { dg-add-options vect_early_break } */ > +/* { dg-require-effective-target vect_early_break } */ > +/* { dg-require-effective-target vect_int } */ > + > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ > + > +#ifndef N > +#define N 800 > +#endif > +unsigned vect_a[N]; > +unsigned vect_b[N]; > + > +unsigned test4(unsigned x) > +{ > + unsigned ret = 0; > + for (int i = 0; i < N; i++) > + { > + vect_b[i] = x + i; > + if (vect_a[i]*2 != x) > + break; > + vect_a[i] = x; > + > + } > + return ret; > +} > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c > new file mode 100644 > index 0000000000000000000000000000000000000000..6d7fb920ec2de529a4aa1de2c4a04286989204fd > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c > @@ -0,0 +1,31 @@ > +/* { dg-do compile } */ > +/* { dg-add-options vect_early_break } */ > +/* { dg-require-effective-target vect_early_break } */ > +/* { dg-require-effective-target vect_int } */ > + > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ > + > +#ifndef N > +#define N 800 > +#endif > +unsigned vect_a[N]; > +unsigned vect_b[N]; > + > +unsigned test4(unsigned x) > +{ > + unsigned ret = 0; > + for (int i = 0; i < N; i+=2) > + { > + vect_b[i] = x + i; > + vect_b[i+1] = x + i+1; > + if (vect_a[i]*2 != x) > + break; > + if (vect_a[i+1]*2 != x) > + break; > + vect_a[i] = x; > + vect_a[i+1] = x; > + > + } > + return ret; > +} > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > index 600987dd6e5d506aa5fbb02350f9dab77793d382..7e765df466a59249feb999c24d8f2dad232948ae 100644 > --- a/gcc/tree-vect-slp.cc > +++ b/gcc/tree-vect-slp.cc > @@ -3697,6 +3697,13 @@ vect_build_slp_instance (vec_info *vinfo, > "Analyzing vectorizable constructor: %G\n", > root_stmt_infos[0]->stmt); > } > + else if (kind == slp_inst_kind_gcond) > + { > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_NOTE, vect_location, > + "Analyzing vectorizable control flow: %G", > + root_stmt_infos[0]->stmt); > + } > > if (dump_enabled_p ()) > { > @@ -4143,6 +4150,12 @@ vect_analyze_slp_instance (vec_info *vinfo, > STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)) > = STMT_VINFO_REDUC_DEF (vect_orig_stmt (scalar_stmts.last ())); > } > + else if (kind == slp_inst_kind_gcond) > + { > + /* Collect the stores and store them in scalar_stmts. */ > + scalar_stmts.create (1); > + scalar_stmts.quick_push (vect_stmt_to_vectorize (next_info)); > + } We have this "left-over" dual API but since you never call vect_analyze_slp_instance with slp_inst_kind_gcond but use vect_build_slp_instance I think you can drop this hunk. > else > gcc_unreachable (); > > @@ -4742,6 +4755,56 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size, > bst_map, NULL, force_single_lane); > } > } > + > + /* Find SLP sequences starting from gconds. */ > + for (auto cond : LOOP_VINFO_LOOP_CONDS (loop_vinfo)) > + { > + auto cond_info = loop_vinfo->lookup_stmt (cond); > + vec<stmt_vec_info> stmts; > + vec<stmt_vec_info> roots = vNULL; > + vec<tree> remain = vNULL; > + > + cond_info = vect_stmt_to_vectorize (cond_info); > + roots.safe_push (cond_info); > + stmts.create (2); > + tree args0 = gimple_cond_lhs (STMT_VINFO_STMT (cond_info)); > + tree args1 = gimple_cond_rhs (STMT_VINFO_STMT (cond_info)); > + /* An argument without a loop def will be codegened from vectorizing the > + root gcond itself. As such we don't need to try to build an SLP tree > + from them. It's highly likely that the resulting SLP tree here if both > + arguments have a def will be incompatible, but we rely on it being split > + later on. */ > + if (auto varg = loop_vinfo->lookup_def (args0)) > + stmts.quick_push (vect_stmt_to_vectorize (varg)); > + > + if (auto varg = loop_vinfo->lookup_def (args1)) > + stmts.quick_push (vect_stmt_to_vectorize (varg)); Hmm - you pattern-replace if (a != b) with patt = a != b; if (patt != 0) correct? So I don't get why you push two scalar stmts in 'stmts' and not just the pattern def of args0? Plus assert the condition is NE_EXPR and the cond_rhs zero? (or simply guard discovery with that - other gconds will simply not be vectorized) That said, when loop_vinfo->lookup_def (args1) is not NULL what you ask for doesn't really make sense - even if it discovers OK (you get vectors with the condition lhs and rhs interleaved). > + if (!stmts.is_empty ()) > + vect_build_slp_instance (vinfo, slp_inst_kind_gcond, > + stmts, roots, remain, > + max_tree_size, &limit, > + bst_map, NULL, force_single_lane); > + } > + > + /* Find and create slp instances for inductions that have been forced > + live due to early break. */ > + edge latch_e = loop_latch_edge (LOOP_VINFO_LOOP (loop_vinfo)); > + for (auto stmt_info : LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS (loop_vinfo)) > + { > + vec<stmt_vec_info> stmts; > + vec<stmt_vec_info> roots = vNULL; > + vec<tree> remain = vNULL; > + gphi *lc_phi = as_a<gphi *> (STMT_VINFO_STMT (stmt_info)); > + tree def = gimple_phi_arg_def_from_edge (lc_phi, latch_e); Hmm, so this isn't the LC PHI but the header PHI! So maybe better named LOOP_VINFO_EARLY_BREAKS_LIVE_IVS? Note we're now walking all actual LC PHIs, but only when marked as live. So maybe you can instead of adding to LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS mark the defs appropriately? Just a suggestion - I think what you do works as well. > + stmt_vec_info lc_info = loop_vinfo->lookup_def (def); > + stmts.create (1); > + stmts.quick_push (vect_stmt_to_vectorize (lc_info)); > + vect_build_slp_instance (vinfo, slp_inst_kind_reduc_group, > + stmts, roots, remain, > + max_tree_size, &limit, > + bst_map, NULL, force_single_lane); > + } > } > > hash_set<slp_tree> visited_patterns; > @@ -7157,8 +7220,9 @@ maybe_push_to_hybrid_worklist (vec_info *vinfo, > } > } > } > - /* No def means this is a loo_vect sink. */ > - if (!any_def) > + /* No def means this is a loop_vect sink. Gimple conditionals also don't have a > + def but shouldn't be considered sinks. */ > + if (!any_def && STMT_VINFO_DEF_TYPE (stmt_info) != vect_condition_def) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_NOTE, vect_location, > @@ -7542,9 +7606,27 @@ vect_slp_analyze_node_operations (vec_info *vinfo, slp_tree node, > return true; > visited_vec.safe_push (node); > > + /* If early break also check the root statement as we need to both analyze > + and trigger codegen for it. The analysis will check whether can actually > + vectorize it. At the memoment splitting off the analsysi bit from inside > + it duplicates a lot of the setup code so it's not worth while to do so. > + However when either the non-SLP loop vect goes away or we split vectorizable_* > + functions then we can call the analysis only part from here instead. */ > bool res = true; > - unsigned visited_rec_start = visited_vec.length (); > unsigned cost_vec_rec_start = cost_vec->length (); > + if (SLP_INSTANCE_KIND (node_instance) == slp_inst_kind_gcond) > + { > + auto root_stmt_info = SLP_INSTANCE_ROOT_STMTS (node_instance)[0]; > + res = vectorizable_early_exit (vinfo, root_stmt_info, NULL, NULL, NULL, > + cost_vec); Can you do this here instead: bool vect_slp_analyze_operations (vec_info *vinfo) { ... /* Check we can vectorize the reduction. */ || (SLP_INSTANCE_KIND (instance) == slp_inst_kind_bb_reduc && !vectorizable_bb_reduc_epilogue (instance, &cost_vec))) as we're checking the roots for slp_inst_kind_ctor and slp_inst_kind_bb_reduc here already. > + if (!res) > + { > + cost_vec->truncate (cost_vec_rec_start); > + return res; > + } > + } > + > + unsigned visited_rec_start = visited_vec.length (); > bool seen_non_constant_child = false; > FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child) > { > @@ -8612,6 +8694,8 @@ vect_slp_check_for_roots (bb_vec_info bb_vinfo) > !gsi_end_p (gsi); gsi_next (&gsi)) > { > gassign *assign = dyn_cast<gassign *> (gsi_stmt (gsi)); > + /* This can be used to start SLP discovery for early breaks for BB early breaks > + when we get that far. */ > if (!assign) > continue; > > @@ -10758,7 +10842,7 @@ vect_remove_slp_scalar_calls (vec_info *vinfo, slp_tree node) > /* Vectorize the instance root. */ > > void > -vectorize_slp_instance_root_stmt (slp_tree node, slp_instance instance) > +vectorize_slp_instance_root_stmt (vec_info *vinfo, slp_tree node, slp_instance instance) > { > gassign *rstmt = NULL; > > @@ -10862,6 +10946,21 @@ vectorize_slp_instance_root_stmt (slp_tree node, slp_instance instance) > update_stmt (gsi_stmt (rgsi)); > return; > } > + else if (instance->kind == slp_inst_kind_gcond) > + { > + /* Only support a single root for now as we can't codegen CFG yet and so we > + can't support lane > 1 at this time. */ > + gcc_assert (instance->root_stmts.length () == 1); > + auto root_stmt_info = instance->root_stmts[0]; > + auto last_stmt = vect_find_first_scalar_stmt_in_slp (node)->stmt; > + gimple_stmt_iterator rgsi = gsi_for_stmt (last_stmt); So last_stmt is actually the gcond (and thus root_stmt_info)? Why not use that for the rgsi directly? > + gimple *vec_stmt = NULL; > + gcc_assert (SLP_TREE_NUMBER_OF_VEC_STMTS (node) != 0); I guess you'd want to assert SLP_TREE_VEC_DEFS is not empty? > + bool res = vectorizable_early_exit (vinfo, root_stmt_info, &rgsi, > + &vec_stmt, node, NULL); > + gcc_assert (res); > + return; > + } > else > gcc_unreachable (); > > @@ -11080,7 +11179,7 @@ vect_schedule_slp (vec_info *vinfo, const vec<slp_instance> &slp_instances) > vect_schedule_scc (vinfo, node, instance, scc_info, maxdfs, stack); > > if (!SLP_INSTANCE_ROOT_STMTS (instance).is_empty ()) > - vectorize_slp_instance_root_stmt (node, instance); > + vectorize_slp_instance_root_stmt (vinfo, node, instance); > > if (dump_enabled_p ()) > dump_printf_loc (MSG_NOTE, vect_location, > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > index b72b54d666879d8485f8d972b4e8d9dc64bc86b3..8f3f35989879199ffd0eb24729cb7ade856a3c4d 100644 > --- a/gcc/tree-vect-stmts.cc > +++ b/gcc/tree-vect-stmts.cc > @@ -411,6 +411,7 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, loop_vec_info loop_vinfo, > dump_printf_loc (MSG_NOTE, vect_location, > "vec_stmt_relevant_p: induction forced for " > "early break.\n"); > + LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS (loop_vinfo).safe_push (stmt_info); > *live_p = true; ah, so it's even live already - what's the relevancy used? See above for the SLP discovery question regarding to that we already walk all actual LC PHI defs. Otherwise looks OK - thanks for the work! Richard. > > } > @@ -12933,7 +12934,7 @@ vectorizable_comparison (vec_info *vinfo, > /* Check to see if the current early break given in STMT_INFO is valid for > vectorization. */ > > -static bool > +bool > vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, > gimple_stmt_iterator *gsi, gimple **vec_stmt, > slp_tree slp_node, stmt_vector_for_cost *cost_vec) > @@ -12958,7 +12959,7 @@ vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, > tree op0; > enum vect_def_type dt0; > if (!vect_is_simple_use (vinfo, stmt_info, slp_node, 0, &op0, &slp_op0, &dt0, > - &vectype)) > + &vectype)) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > @@ -12966,6 +12967,13 @@ vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, > return false; > } > > + /* For SLP we don't want to use the type of the operands of the SLP node, when > + vectorizing using SLP slp_node will be the children of the gcond and we want to > + use the type of the direct children which since the gcond is root will be the > + current node, rather than a child node as vect_is_simple_use assumes. */ > + if (slp_node) > + vectype = SLP_TREE_VECTYPE (slp_node); > + > if (!vectype) > return false; > > @@ -13060,9 +13068,18 @@ vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, > if (dump_enabled_p ()) > dump_printf_loc (MSG_NOTE, vect_location, "transform early-exit.\n"); > > - if (!vectorizable_comparison_1 (vinfo, vectype, stmt_info, code, gsi, > - vec_stmt, slp_node, cost_vec)) > - gcc_unreachable (); > + /* For SLP we don't do codegen of the body starting from the gcond, the gconds are > + roots and so by the time we get to them we have already codegened the SLP tree > + and so we shouldn't try to do so again. The arguments have already been > + vectorized. It's not very clean to do this here, But the masking code below is > + complex and this keeps it all in one place to ease fixes and backports. Once we > + drop the non-SLP loop vect or split vectorizable_* this can be simplified. */ > + if (!slp_node) > + { > + if (!vectorizable_comparison_1 (vinfo, vectype, stmt_info, code, gsi, > + vec_stmt, slp_node, cost_vec)) > + gcc_unreachable (); > + } > > gimple *stmt = STMT_VINFO_STMT (stmt_info); > basic_block cond_bb = gimple_bb (stmt); > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h > index 490061aea2f6d465d9589eb97bbd34a920d76b1c..53483303c4ac3482760fe722354f602e0243e5a2 100644 > --- a/gcc/tree-vectorizer.h > +++ b/gcc/tree-vectorizer.h > @@ -252,7 +252,8 @@ enum slp_instance_kind { > slp_inst_kind_reduc_group, > slp_inst_kind_reduc_chain, > slp_inst_kind_bb_reduc, > - slp_inst_kind_ctor > + slp_inst_kind_ctor, > + slp_inst_kind_gcond > }; > > /* SLP instance is a sequence of stmts in a loop that can be packed into > @@ -977,6 +978,10 @@ public: > /* Statements whose VUSES need updating if early break vectorization is to > happen. */ > auto_vec<gimple*> early_break_vuses; > + > + /* Record statements that are needed to be live for early break vectorization > + but may not have an LC PHI node materialized yet in the exits. */ > + auto_vec<stmt_vec_info> early_break_live_stmts; > } *loop_vec_info; > > /* Access Functions. */ > @@ -1036,6 +1041,8 @@ public: > #define LOOP_VINFO_EARLY_BRK_STORES(L) (L)->early_break_stores > #define LOOP_VINFO_EARLY_BREAKS_VECT_PEELED(L) \ > (single_pred ((L)->loop->latch) != (L)->vec_loop_iv_exit->src) > +#define LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS(L) \ > + (L)->early_break_live_stmts > #define LOOP_VINFO_EARLY_BRK_DEST_BB(L) (L)->early_break_dest_bb > #define LOOP_VINFO_EARLY_BRK_VUSES(L) (L)->early_break_vuses > #define LOOP_VINFO_LOOP_CONDS(L) (L)->conds > @@ -2522,6 +2529,9 @@ extern bool vectorizable_phi (vec_info *, stmt_vec_info, gimple **, slp_tree, > stmt_vector_for_cost *); > extern bool vectorizable_recurr (loop_vec_info, stmt_vec_info, > gimple **, slp_tree, stmt_vector_for_cost *); > +extern bool vectorizable_early_exit (vec_info *, stmt_vec_info, > + gimple_stmt_iterator *, gimple **, > + slp_tree, stmt_vector_for_cost *); > extern bool vect_emulated_vector_p (tree); > extern bool vect_can_vectorize_without_simd_p (tree_code); > extern bool vect_can_vectorize_without_simd_p (code_helper); > > > > >
> -----Original Message----- > From: Richard Biener <rguenther@suse.de> > Sent: Wednesday, October 2, 2024 1:50 PM > To: Tamar Christina <Tamar.Christina@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; jlaw@ventanamicro.com > Subject: Re: [PATCH]middle-end: support SLP early break > > On Tue, 1 Oct 2024, Tamar Christina wrote: > > > Hi all, > > > > This patch introduces feature parity for early break int the SLP only > > vectorizer. > > > > The approach taken here is to treat the early exits as root statements for an > > SLP tree. This means that we don't need any changes to build_slp to support > > gconds. > > > > Codegen for the gcond itself now has to be done out of line but the body of the > > SLP blocks itself is simply driven by SLP scheduling. There is a slight > > awkwardness in having re-used vectorizable_early_exit for both SLP and non-SLP > > but I've documented the differences and when I did try to refactor it it wasn't > > really worth it given that this is a temporary state anyway. > > > > This version is restricted to lane = 1, as such we can re-use the existing > > move_early_break function instead of having to do safety update through > > scheduling. I have a branch where I'm working on that but lane > 1 is out of > > scope for GCC 15 anyway. The only reason I will try to get moving through > > scheduling done as a stretch goal is so we get epilogue vectorization back for > > early break. > > > > The example: > > > > unsigned test4(unsigned x) > > { > > unsigned ret = 0; > > for (int i = 0; i < N; i++) > > { > > vect_b[i] = x + i; > > if (vect_a[i]*2 != x) > > break; > > vect_a[i] = x; > > > > } > > return ret; > > } > > > > builds the following SLP instance for early break: > > > > note: Analyzing vectorizable control flow: if (patt_6 != 0) > > note: Starting SLP discovery for > > note: patt_6 = _4 != x_9(D); > > note: starting SLP discovery for node 0x63abc80 > > note: Build SLP for patt_6 = _4 != x_9(D); > > note: precomputed vectype: vector(4) <signed-boolean:32> > > note: nunits = 4 > > note: vect_is_simple_use: operand x_9(D), type of def: external > > note: vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, +INF] > MASK 0xffff > > _3 * 2, type of def: internal > > note: starting SLP discovery for node 0x63abdc0 > > note: Build SLP for _4 = _3 * 2; > > note: precomputed vectype: vector(4) unsigned int > > note: nunits = 4 > > note: vect_is_simple_use: operand # > > vect_aD.4416[i_15], type of def: internal > > note: vect_is_simple_use: operand 2, type of def: constant > > note: starting SLP discovery for node 0x63abe60 > > note: Build SLP for _3 = vect_a[i_15]; > > note: precomputed vectype: vector(4) unsigned int > > note: nunits = 4 > > note: SLP discovery for node 0x63abe60 succeeded > > note: SLP discovery for node 0x63abdc0 succeeded > > note: SLP discovery for node 0x63abc80 succeeded > > note: SLP size 3 vs. limit 10. > > note: Final SLP tree for instance 0x6474190: > > note: node 0x63abc80 (max_nunits=4, refcnt=2) vector(4) <signed-boolean:32> > > note: op template: patt_6 = _4 != x_9(D); > > note: stmt 0 patt_6 = _4 != x_9(D); > > note: children 0x63abd20 0x63abdc0 > > note: node (external) 0x63abd20 (max_nunits=1, refcnt=1) > > note: { x_9(D) } > > note: node 0x63abdc0 (max_nunits=4, refcnt=2) vector(4) unsigned int > > note: op template: _4 = _3 * 2; > > note: stmt 0 _4 = _3 * 2; > > note: children 0x63abe60 0x63abf00 > > note: node 0x63abe60 (max_nunits=4, refcnt=2) vector(4) unsigned int > > note: op template: _3 = vect_a[i_15]; > > note: stmt 0 _3 = vect_a[i_15]; > > note: load permutation { 0 } > > note: node (constant) 0x63abf00 (max_nunits=1, refcnt=1) > > note: { 2 } > > > > and during codegen: > > > > note: ------>vectorizing SLP node starting from: patt_6 = _4 != x_9(D); > > note: vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, +INF] > MASK 0xffff > > _3 * 2, type of def: internal > > note: add new stmt: mask_patt_6.18_58 = _53 != vect__4.17_57; > > note: === vectorizable_early_exit === > > note: transform early-exit. > > note: vectorizing stmts using SLP. > > note: Vectorizing SLP tree: > > note: node 0x63abfa0 (max_nunits=4, refcnt=1) vector(4) int > > note: op template: i_12 = i_15 + 1; > > note: stmt 0 i_12 = i_15 + 1; > > note: children 0x63aba00 0x63ac040 > > note: node 0x63aba00 (max_nunits=4, refcnt=2) vector(4) int > > note: op template: i_15 = PHI <i_12(6), 0(14)> > > note: [l] stmt 0 i_15 = PHI <i_12(6), 0(14)> > > note: children (nil) (nil) > > note: node (constant) 0x63ac040 (max_nunits=1, refcnt=1) vector(4) int > > note: { 1 } > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf, > > x86_64-pc-linux-gnu -m32, -m64 and no issues. > > > > Also bootstrapped --with-build-config='bootstrap-O3 bootstrap-lto' > > --enable-checking=release,yes,rtl,extra on aarch64-none-linux-gnu and > > x86_64-pc-linux-gnu -m32, -m64 and no issues. > > > > Ok for master? > > > > gcc/ChangeLog: > > > > * tree-vectorizer.h (enum slp_instance_kind): Add slp_inst_kind_gcond. > > (LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS): New. > > (vectorizable_early_exit): Expose. > > (class _loop_vec_info): Add early_break_live_stmts. > > * tree-vect-slp.cc (vect_build_slp_instance, vect_analyze_slp_instance): > > Support gcond instances. > > (vect_analyze_slp): Analyze gcond roots and early break live statements. > > (maybe_push_to_hybrid_worklist): Don't sink gconds. > > (vect_slp_analyze_node_operations): Support gconds. > > (vect_slp_check_for_roots): Update comments. > > (vectorize_slp_instance_root_stmt): Support gconds. > > (vect_schedule_slp): Pass vinfo to vectorize_slp_instance_root_stmt. > > * tree-vect-stmts.cc (vect_stmt_relevant_p): Record early break live > > statements. > > (vectorizable_early_exit): Support SLP. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.dg/vect/vect-early-break_126.c: New test. > > * gcc.dg/vect/vect-early-break_127.c: New test. > > * gcc.dg/vect/vect-early-break_128.c: New test. > > > > --- > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c > b/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c > > new file mode 100644 > > index > 0000000000000000000000000000000000000000..4bfc9880f9fc869bf616123 > ff509d13be17ffacf > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c > > @@ -0,0 +1,28 @@ > > +/* { dg-do compile } */ > > +/* { dg-add-options vect_early_break } */ > > +/* { dg-require-effective-target vect_early_break } */ > > +/* { dg-require-effective-target vect_int } */ > > + > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > > +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ > > + > > +#define N 1024 > > +unsigned vect_a[N]; > > +unsigned vect_b[N]; > > + > > +unsigned test4(unsigned x) > > +{ > > + unsigned ret = 0; > > + for (int i = 0; i < N; i++) > > + { > > + vect_b[i] = x + i; > > + if (vect_a[i] > x) > > + { > > + ret *= vect_a[i]; > > + return vect_a[i]; > > + } > > + vect_a[i] = x; > > + ret += vect_a[i] + vect_b[i]; > > + } > > + return ret; > > +} > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c > b/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c > > new file mode 100644 > > index > 0000000000000000000000000000000000000000..67cb5d34a77192e5d7d72 > c35df8e83535ef184ab > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c > > @@ -0,0 +1,27 @@ > > +/* { dg-do compile } */ > > +/* { dg-add-options vect_early_break } */ > > +/* { dg-require-effective-target vect_early_break } */ > > +/* { dg-require-effective-target vect_int } */ > > + > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > > +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ > > + > > +#ifndef N > > +#define N 800 > > +#endif > > +unsigned vect_a[N]; > > +unsigned vect_b[N]; > > + > > +unsigned test4(unsigned x) > > +{ > > + unsigned ret = 0; > > + for (int i = 0; i < N; i++) > > + { > > + vect_b[i] = x + i; > > + if (vect_a[i]*2 != x) > > + break; > > + vect_a[i] = x; > > + > > + } > > + return ret; > > +} > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c > b/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c > > new file mode 100644 > > index > 0000000000000000000000000000000000000000..6d7fb920ec2de529a4aa1d > e2c4a04286989204fd > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c > > @@ -0,0 +1,31 @@ > > +/* { dg-do compile } */ > > +/* { dg-add-options vect_early_break } */ > > +/* { dg-require-effective-target vect_early_break } */ > > +/* { dg-require-effective-target vect_int } */ > > + > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > > +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ > > + > > +#ifndef N > > +#define N 800 > > +#endif > > +unsigned vect_a[N]; > > +unsigned vect_b[N]; > > + > > +unsigned test4(unsigned x) > > +{ > > + unsigned ret = 0; > > + for (int i = 0; i < N; i+=2) > > + { > > + vect_b[i] = x + i; > > + vect_b[i+1] = x + i+1; > > + if (vect_a[i]*2 != x) > > + break; > > + if (vect_a[i+1]*2 != x) > > + break; > > + vect_a[i] = x; > > + vect_a[i+1] = x; > > + > > + } > > + return ret; > > +} > > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > > index > 600987dd6e5d506aa5fbb02350f9dab77793d382..7e765df466a59249feb999c > 24d8f2dad232948ae 100644 > > --- a/gcc/tree-vect-slp.cc > > +++ b/gcc/tree-vect-slp.cc > > @@ -3697,6 +3697,13 @@ vect_build_slp_instance (vec_info *vinfo, > > "Analyzing vectorizable constructor: %G\n", > > root_stmt_infos[0]->stmt); > > } > > + else if (kind == slp_inst_kind_gcond) > > + { > > + if (dump_enabled_p ()) > > + dump_printf_loc (MSG_NOTE, vect_location, > > + "Analyzing vectorizable control flow: %G", > > + root_stmt_infos[0]->stmt); > > + } > > > > if (dump_enabled_p ()) > > { > > @@ -4143,6 +4150,12 @@ vect_analyze_slp_instance (vec_info *vinfo, > > STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)) > > = STMT_VINFO_REDUC_DEF (vect_orig_stmt (scalar_stmts.last ())); > > } > > + else if (kind == slp_inst_kind_gcond) > > + { > > + /* Collect the stores and store them in scalar_stmts. */ > > + scalar_stmts.create (1); > > + scalar_stmts.quick_push (vect_stmt_to_vectorize (next_info)); > > + } > > We have this "left-over" dual API but since you never call > vect_analyze_slp_instance with slp_inst_kind_gcond but use > vect_build_slp_instance I think you can drop this hunk. > True, I missed dropping it during cleanup. > > else > > gcc_unreachable (); > > > > @@ -4742,6 +4755,56 @@ vect_analyze_slp (vec_info *vinfo, unsigned > max_tree_size, > > bst_map, NULL, force_single_lane); > > } > > } > > + > > + /* Find SLP sequences starting from gconds. */ > > + for (auto cond : LOOP_VINFO_LOOP_CONDS (loop_vinfo)) > > + { > > + auto cond_info = loop_vinfo->lookup_stmt (cond); > > + vec<stmt_vec_info> stmts; > > + vec<stmt_vec_info> roots = vNULL; > > + vec<tree> remain = vNULL; > > + > > + cond_info = vect_stmt_to_vectorize (cond_info); > > + roots.safe_push (cond_info); > > + stmts.create (2); > > + tree args0 = gimple_cond_lhs (STMT_VINFO_STMT (cond_info)); > > + tree args1 = gimple_cond_rhs (STMT_VINFO_STMT (cond_info)); > > + /* An argument without a loop def will be codegened from vectorizing > the > > + root gcond itself. As such we don't need to try to build an SLP tree > > + from them. It's highly likely that the resulting SLP tree here if both > > + arguments have a def will be incompatible, but we rely on it being split > > + later on. */ > > + if (auto varg = loop_vinfo->lookup_def (args0)) > > + stmts.quick_push (vect_stmt_to_vectorize (varg)); > > + > > + if (auto varg = loop_vinfo->lookup_def (args1)) > > + stmts.quick_push (vect_stmt_to_vectorize (varg)); > > Hmm - you pattern-replace > > if (a != b) > > with > > patt = a != b; > if (patt != 0) > > correct? So I don't get why you push two scalar stmts in 'stmts' > and not just the pattern def of args0? Plus assert the condition > is NE_EXPR and the cond_rhs zero? (or simply guard discovery with > that - other gconds will simply not be vectorized) > > That said, when loop_vinfo->lookup_def (args1) is not NULL > what you ask for doesn't really make sense - even if it discovers > OK (you get vectors with the condition lhs and rhs interleaved). > > > + if (!stmts.is_empty ()) > > + vect_build_slp_instance (vinfo, slp_inst_kind_gcond, > > + stmts, roots, remain, > > + max_tree_size, &limit, > > + bst_map, NULL, force_single_lane); > > + } > > + > > + /* Find and create slp instances for inductions that have been forced > > + live due to early break. */ > > + edge latch_e = loop_latch_edge (LOOP_VINFO_LOOP (loop_vinfo)); > > + for (auto stmt_info : LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS > (loop_vinfo)) > > + { > > + vec<stmt_vec_info> stmts; > > + vec<stmt_vec_info> roots = vNULL; > > + vec<tree> remain = vNULL; > > + gphi *lc_phi = as_a<gphi *> (STMT_VINFO_STMT (stmt_info)); > > + tree def = gimple_phi_arg_def_from_edge (lc_phi, latch_e); > > Hmm, so this isn't the LC PHI but the header PHI! So maybe better named > LOOP_VINFO_EARLY_BREAKS_LIVE_IVS? > > Note we're now walking all actual LC PHIs, but only when marked as > live. So maybe you can instead of adding to > LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS mark the defs appropriately? > Just a suggestion - I think what you do works as well. > > > + stmt_vec_info lc_info = loop_vinfo->lookup_def (def); > > + stmts.create (1); > > + stmts.quick_push (vect_stmt_to_vectorize (lc_info)); > > + vect_build_slp_instance (vinfo, slp_inst_kind_reduc_group, > > + stmts, roots, remain, > > + max_tree_size, &limit, > > + bst_map, NULL, force_single_lane); > > + } > > } > > > > hash_set<slp_tree> visited_patterns; > > @@ -7157,8 +7220,9 @@ maybe_push_to_hybrid_worklist (vec_info *vinfo, > > } > > } > > } > > - /* No def means this is a loo_vect sink. */ > > - if (!any_def) > > + /* No def means this is a loop_vect sink. Gimple conditionals also don't have a > > + def but shouldn't be considered sinks. */ > > + if (!any_def && STMT_VINFO_DEF_TYPE (stmt_info) != vect_condition_def) > > { > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_NOTE, vect_location, > > @@ -7542,9 +7606,27 @@ vect_slp_analyze_node_operations (vec_info *vinfo, > slp_tree node, > > return true; > > visited_vec.safe_push (node); > > > > + /* If early break also check the root statement as we need to both analyze > > + and trigger codegen for it. The analysis will check whether can actually > > + vectorize it. At the memoment splitting off the analsysi bit from inside > > + it duplicates a lot of the setup code so it's not worth while to do so. > > + However when either the non-SLP loop vect goes away or we split > vectorizable_* > > + functions then we can call the analysis only part from here instead. */ > > bool res = true; > > - unsigned visited_rec_start = visited_vec.length (); > > unsigned cost_vec_rec_start = cost_vec->length (); > > + if (SLP_INSTANCE_KIND (node_instance) == slp_inst_kind_gcond) > > + { > > + auto root_stmt_info = SLP_INSTANCE_ROOT_STMTS (node_instance)[0]; > > + res = vectorizable_early_exit (vinfo, root_stmt_info, NULL, NULL, NULL, > > + cost_vec); > > Can you do this here instead: > > bool > vect_slp_analyze_operations (vec_info *vinfo) > { > ... > /* Check we can vectorize the reduction. */ > || (SLP_INSTANCE_KIND (instance) == slp_inst_kind_bb_reduc > && !vectorizable_bb_reduc_epilogue (instance, &cost_vec))) > > as we're checking the roots for slp_inst_kind_ctor and > slp_inst_kind_bb_reduc here already. > > > + if (!res) > > + { > > + cost_vec->truncate (cost_vec_rec_start); > > + return res; > > + } > > + } > > + > > + unsigned visited_rec_start = visited_vec.length (); > > bool seen_non_constant_child = false; > > FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child) > > { > > @@ -8612,6 +8694,8 @@ vect_slp_check_for_roots (bb_vec_info bb_vinfo) > > !gsi_end_p (gsi); gsi_next (&gsi)) > > { > > gassign *assign = dyn_cast<gassign *> (gsi_stmt (gsi)); > > + /* This can be used to start SLP discovery for early breaks for BB early breaks > > + when we get that far. */ > > if (!assign) > > continue; > > > > @@ -10758,7 +10842,7 @@ vect_remove_slp_scalar_calls (vec_info *vinfo, > slp_tree node) > > /* Vectorize the instance root. */ > > > > void > > -vectorize_slp_instance_root_stmt (slp_tree node, slp_instance instance) > > +vectorize_slp_instance_root_stmt (vec_info *vinfo, slp_tree node, slp_instance > instance) > > { > > gassign *rstmt = NULL; > > > > @@ -10862,6 +10946,21 @@ vectorize_slp_instance_root_stmt (slp_tree node, > slp_instance instance) > > update_stmt (gsi_stmt (rgsi)); > > return; > > } > > + else if (instance->kind == slp_inst_kind_gcond) > > + { > > + /* Only support a single root for now as we can't codegen CFG yet and so we > > + can't support lane > 1 at this time. */ > > + gcc_assert (instance->root_stmts.length () == 1); > > + auto root_stmt_info = instance->root_stmts[0]; > > + auto last_stmt = vect_find_first_scalar_stmt_in_slp (node)->stmt; > > + gimple_stmt_iterator rgsi = gsi_for_stmt (last_stmt); > > So last_stmt is actually the gcond (and thus root_stmt_info)? Why > not use that for the rgsi directly? > > > + gimple *vec_stmt = NULL; > > + gcc_assert (SLP_TREE_NUMBER_OF_VEC_STMTS (node) != 0); > > I guess you'd want to assert SLP_TREE_VEC_DEFS is not empty? > > > + bool res = vectorizable_early_exit (vinfo, root_stmt_info, &rgsi, > > + &vec_stmt, node, NULL); > > + gcc_assert (res); > > + return; > > + } > > else > > gcc_unreachable (); > > > > @@ -11080,7 +11179,7 @@ vect_schedule_slp (vec_info *vinfo, const > vec<slp_instance> &slp_instances) > > vect_schedule_scc (vinfo, node, instance, scc_info, maxdfs, stack); > > > > if (!SLP_INSTANCE_ROOT_STMTS (instance).is_empty ()) > > - vectorize_slp_instance_root_stmt (node, instance); > > + vectorize_slp_instance_root_stmt (vinfo, node, instance); > > > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_NOTE, vect_location, > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > > index > b72b54d666879d8485f8d972b4e8d9dc64bc86b3..8f3f35989879199ffd0eb24 > 729cb7ade856a3c4d 100644 > > --- a/gcc/tree-vect-stmts.cc > > +++ b/gcc/tree-vect-stmts.cc > > @@ -411,6 +411,7 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, > loop_vec_info loop_vinfo, > > dump_printf_loc (MSG_NOTE, vect_location, > > "vec_stmt_relevant_p: induction forced for " > > "early break.\n"); > > + LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS (loop_vinfo).safe_push > (stmt_info); > > *live_p = true; > > ah, so it's even live already - what's the relevancy used? See above > for the SLP discovery question regarding to that we already walk all > actual LC PHI defs. The issue is that we're early. We force the PHIs live, but their loop based usages haven't been materialized yet, they will be by peeling. This leads to the situation that Hybrid detection sees them as live, but does not see a vect use of them. To avoid this we have to manually construct the SLP tree since we need to codegen them anyway. Which has two benefits 1. it makes hybrid detection see the PHIs as used in loop 2. It makes it possible to re-order them during scheduling to avoid the scheduling dependency issue where the old value gets copied in every loop iteration because the IV increase happens before the loop latch where the last use of the old IV is. > > Otherwise looks OK - thanks for the work! > Was aiming for a single review patch, but fell short :) Will get to respinning Cheers, Tamar > Richard. > > > > > } > > @@ -12933,7 +12934,7 @@ vectorizable_comparison (vec_info *vinfo, > > /* Check to see if the current early break given in STMT_INFO is valid for > > vectorization. */ > > > > -static bool > > +bool > > vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, > > gimple_stmt_iterator *gsi, gimple **vec_stmt, > > slp_tree slp_node, stmt_vector_for_cost *cost_vec) > > @@ -12958,7 +12959,7 @@ vectorizable_early_exit (vec_info *vinfo, > stmt_vec_info stmt_info, > > tree op0; > > enum vect_def_type dt0; > > if (!vect_is_simple_use (vinfo, stmt_info, slp_node, 0, &op0, &slp_op0, &dt0, > > - &vectype)) > > + &vectype)) > > { > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > @@ -12966,6 +12967,13 @@ vectorizable_early_exit (vec_info *vinfo, > stmt_vec_info stmt_info, > > return false; > > } > > > > + /* For SLP we don't want to use the type of the operands of the SLP node, > when > > + vectorizing using SLP slp_node will be the children of the gcond and we want > to > > + use the type of the direct children which since the gcond is root will be the > > + current node, rather than a child node as vect_is_simple_use assumes. */ > > + if (slp_node) > > + vectype = SLP_TREE_VECTYPE (slp_node); > > + > > if (!vectype) > > return false; > > > > @@ -13060,9 +13068,18 @@ vectorizable_early_exit (vec_info *vinfo, > stmt_vec_info stmt_info, > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_NOTE, vect_location, "transform early-exit.\n"); > > > > - if (!vectorizable_comparison_1 (vinfo, vectype, stmt_info, code, gsi, > > - vec_stmt, slp_node, cost_vec)) > > - gcc_unreachable (); > > + /* For SLP we don't do codegen of the body starting from the gcond, the > gconds are > > + roots and so by the time we get to them we have already codegened the SLP > tree > > + and so we shouldn't try to do so again. The arguments have already been > > + vectorized. It's not very clean to do this here, But the masking code below is > > + complex and this keeps it all in one place to ease fixes and backports. Once > we > > + drop the non-SLP loop vect or split vectorizable_* this can be simplified. */ > > + if (!slp_node) > > + { > > + if (!vectorizable_comparison_1 (vinfo, vectype, stmt_info, code, gsi, > > + vec_stmt, slp_node, cost_vec)) > > + gcc_unreachable (); > > + } > > > > gimple *stmt = STMT_VINFO_STMT (stmt_info); > > basic_block cond_bb = gimple_bb (stmt); > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h > > index > 490061aea2f6d465d9589eb97bbd34a920d76b1c..53483303c4ac3482760fe72 > 2354f602e0243e5a2 100644 > > --- a/gcc/tree-vectorizer.h > > +++ b/gcc/tree-vectorizer.h > > @@ -252,7 +252,8 @@ enum slp_instance_kind { > > slp_inst_kind_reduc_group, > > slp_inst_kind_reduc_chain, > > slp_inst_kind_bb_reduc, > > - slp_inst_kind_ctor > > + slp_inst_kind_ctor, > > + slp_inst_kind_gcond > > }; > > > > /* SLP instance is a sequence of stmts in a loop that can be packed into > > @@ -977,6 +978,10 @@ public: > > /* Statements whose VUSES need updating if early break vectorization is to > > happen. */ > > auto_vec<gimple*> early_break_vuses; > > + > > + /* Record statements that are needed to be live for early break vectorization > > + but may not have an LC PHI node materialized yet in the exits. */ > > + auto_vec<stmt_vec_info> early_break_live_stmts; > > } *loop_vec_info; > > > > /* Access Functions. */ > > @@ -1036,6 +1041,8 @@ public: > > #define LOOP_VINFO_EARLY_BRK_STORES(L) (L)->early_break_stores > > #define LOOP_VINFO_EARLY_BREAKS_VECT_PEELED(L) \ > > (single_pred ((L)->loop->latch) != (L)->vec_loop_iv_exit->src) > > +#define LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS(L) \ > > + (L)->early_break_live_stmts > > #define LOOP_VINFO_EARLY_BRK_DEST_BB(L) (L)->early_break_dest_bb > > #define LOOP_VINFO_EARLY_BRK_VUSES(L) (L)->early_break_vuses > > #define LOOP_VINFO_LOOP_CONDS(L) (L)->conds > > @@ -2522,6 +2529,9 @@ extern bool vectorizable_phi (vec_info *, > stmt_vec_info, gimple **, slp_tree, > > stmt_vector_for_cost *); > > extern bool vectorizable_recurr (loop_vec_info, stmt_vec_info, > > gimple **, slp_tree, stmt_vector_for_cost *); > > +extern bool vectorizable_early_exit (vec_info *, stmt_vec_info, > > + gimple_stmt_iterator *, gimple **, > > + slp_tree, stmt_vector_for_cost *); > > extern bool vect_emulated_vector_p (tree); > > extern bool vect_can_vectorize_without_simd_p (tree_code); > > extern bool vect_can_vectorize_without_simd_p (code_helper); > > > > > > > > > > > > -- > Richard Biener <rguenther@suse.de> > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> -----Original Message----- > From: Richard Biener <rguenther@suse.de> > Sent: Wednesday, October 2, 2024 1:50 PM > To: Tamar Christina <Tamar.Christina@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; jlaw@ventanamicro.com > Subject: Re: [PATCH]middle-end: support SLP early break > > On Tue, 1 Oct 2024, Tamar Christina wrote: > > > Hi all, > > > > This patch introduces feature parity for early break int the SLP only > > vectorizer. > > > > The approach taken here is to treat the early exits as root statements for an > > SLP tree. This means that we don't need any changes to build_slp to support > > gconds. > > > > Codegen for the gcond itself now has to be done out of line but the body of the > > SLP blocks itself is simply driven by SLP scheduling. There is a slight > > awkwardness in having re-used vectorizable_early_exit for both SLP and non-SLP > > but I've documented the differences and when I did try to refactor it it wasn't > > really worth it given that this is a temporary state anyway. > > > > This version is restricted to lane = 1, as such we can re-use the existing > > move_early_break function instead of having to do safety update through > > scheduling. I have a branch where I'm working on that but lane > 1 is out of > > scope for GCC 15 anyway. The only reason I will try to get moving through > > scheduling done as a stretch goal is so we get epilogue vectorization back for > > early break. > > > > The example: > > > > unsigned test4(unsigned x) > > { > > unsigned ret = 0; > > for (int i = 0; i < N; i++) > > { > > vect_b[i] = x + i; > > if (vect_a[i]*2 != x) > > break; > > vect_a[i] = x; > > > > } > > return ret; > > } > > > > builds the following SLP instance for early break: > > > > note: Analyzing vectorizable control flow: if (patt_6 != 0) > > note: Starting SLP discovery for > > note: patt_6 = _4 != x_9(D); > > note: starting SLP discovery for node 0x63abc80 > > note: Build SLP for patt_6 = _4 != x_9(D); > > note: precomputed vectype: vector(4) <signed-boolean:32> > > note: nunits = 4 > > note: vect_is_simple_use: operand x_9(D), type of def: external > > note: vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, +INF] > MASK 0xffff > > _3 * 2, type of def: internal > > note: starting SLP discovery for node 0x63abdc0 > > note: Build SLP for _4 = _3 * 2; > > note: precomputed vectype: vector(4) unsigned int > > note: nunits = 4 > > note: vect_is_simple_use: operand # > > vect_aD.4416[i_15], type of def: internal > > note: vect_is_simple_use: operand 2, type of def: constant > > note: starting SLP discovery for node 0x63abe60 > > note: Build SLP for _3 = vect_a[i_15]; > > note: precomputed vectype: vector(4) unsigned int > > note: nunits = 4 > > note: SLP discovery for node 0x63abe60 succeeded > > note: SLP discovery for node 0x63abdc0 succeeded > > note: SLP discovery for node 0x63abc80 succeeded > > note: SLP size 3 vs. limit 10. > > note: Final SLP tree for instance 0x6474190: > > note: node 0x63abc80 (max_nunits=4, refcnt=2) vector(4) <signed-boolean:32> > > note: op template: patt_6 = _4 != x_9(D); > > note: stmt 0 patt_6 = _4 != x_9(D); > > note: children 0x63abd20 0x63abdc0 > > note: node (external) 0x63abd20 (max_nunits=1, refcnt=1) > > note: { x_9(D) } > > note: node 0x63abdc0 (max_nunits=4, refcnt=2) vector(4) unsigned int > > note: op template: _4 = _3 * 2; > > note: stmt 0 _4 = _3 * 2; > > note: children 0x63abe60 0x63abf00 > > note: node 0x63abe60 (max_nunits=4, refcnt=2) vector(4) unsigned int > > note: op template: _3 = vect_a[i_15]; > > note: stmt 0 _3 = vect_a[i_15]; > > note: load permutation { 0 } > > note: node (constant) 0x63abf00 (max_nunits=1, refcnt=1) > > note: { 2 } > > > > and during codegen: > > > > note: ------>vectorizing SLP node starting from: patt_6 = _4 != x_9(D); > > note: vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, +INF] > MASK 0xffff > > _3 * 2, type of def: internal > > note: add new stmt: mask_patt_6.18_58 = _53 != vect__4.17_57; > > note: === vectorizable_early_exit === > > note: transform early-exit. > > note: vectorizing stmts using SLP. > > note: Vectorizing SLP tree: > > note: node 0x63abfa0 (max_nunits=4, refcnt=1) vector(4) int > > note: op template: i_12 = i_15 + 1; > > note: stmt 0 i_12 = i_15 + 1; > > note: children 0x63aba00 0x63ac040 > > note: node 0x63aba00 (max_nunits=4, refcnt=2) vector(4) int > > note: op template: i_15 = PHI <i_12(6), 0(14)> > > note: [l] stmt 0 i_15 = PHI <i_12(6), 0(14)> > > note: children (nil) (nil) > > note: node (constant) 0x63ac040 (max_nunits=1, refcnt=1) vector(4) int > > note: { 1 } > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf, > > x86_64-pc-linux-gnu -m32, -m64 and no issues. > > > > Also bootstrapped --with-build-config='bootstrap-O3 bootstrap-lto' > > --enable-checking=release,yes,rtl,extra on aarch64-none-linux-gnu and > > x86_64-pc-linux-gnu -m32, -m64 and no issues. > > > > Ok for master? > > > > gcc/ChangeLog: > > > > * tree-vectorizer.h (enum slp_instance_kind): Add slp_inst_kind_gcond. > > (LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS): New. > > (vectorizable_early_exit): Expose. > > (class _loop_vec_info): Add early_break_live_stmts. > > * tree-vect-slp.cc (vect_build_slp_instance, vect_analyze_slp_instance): > > Support gcond instances. > > (vect_analyze_slp): Analyze gcond roots and early break live statements. > > (maybe_push_to_hybrid_worklist): Don't sink gconds. > > (vect_slp_analyze_node_operations): Support gconds. > > (vect_slp_check_for_roots): Update comments. > > (vectorize_slp_instance_root_stmt): Support gconds. > > (vect_schedule_slp): Pass vinfo to vectorize_slp_instance_root_stmt. > > * tree-vect-stmts.cc (vect_stmt_relevant_p): Record early break live > > statements. > > (vectorizable_early_exit): Support SLP. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.dg/vect/vect-early-break_126.c: New test. > > * gcc.dg/vect/vect-early-break_127.c: New test. > > * gcc.dg/vect/vect-early-break_128.c: New test. > > > > --- > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c > b/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c > > new file mode 100644 > > index > 0000000000000000000000000000000000000000..4bfc9880f9fc869bf616123 > ff509d13be17ffacf > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c > > @@ -0,0 +1,28 @@ > > +/* { dg-do compile } */ > > +/* { dg-add-options vect_early_break } */ > > +/* { dg-require-effective-target vect_early_break } */ > > +/* { dg-require-effective-target vect_int } */ > > + > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > > +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ > > + > > +#define N 1024 > > +unsigned vect_a[N]; > > +unsigned vect_b[N]; > > + > > +unsigned test4(unsigned x) > > +{ > > + unsigned ret = 0; > > + for (int i = 0; i < N; i++) > > + { > > + vect_b[i] = x + i; > > + if (vect_a[i] > x) > > + { > > + ret *= vect_a[i]; > > + return vect_a[i]; > > + } > > + vect_a[i] = x; > > + ret += vect_a[i] + vect_b[i]; > > + } > > + return ret; > > +} > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c > b/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c > > new file mode 100644 > > index > 0000000000000000000000000000000000000000..67cb5d34a77192e5d7d72 > c35df8e83535ef184ab > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c > > @@ -0,0 +1,27 @@ > > +/* { dg-do compile } */ > > +/* { dg-add-options vect_early_break } */ > > +/* { dg-require-effective-target vect_early_break } */ > > +/* { dg-require-effective-target vect_int } */ > > + > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > > +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ > > + > > +#ifndef N > > +#define N 800 > > +#endif > > +unsigned vect_a[N]; > > +unsigned vect_b[N]; > > + > > +unsigned test4(unsigned x) > > +{ > > + unsigned ret = 0; > > + for (int i = 0; i < N; i++) > > + { > > + vect_b[i] = x + i; > > + if (vect_a[i]*2 != x) > > + break; > > + vect_a[i] = x; > > + > > + } > > + return ret; > > +} > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c > b/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c > > new file mode 100644 > > index > 0000000000000000000000000000000000000000..6d7fb920ec2de529a4aa1d > e2c4a04286989204fd > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c > > @@ -0,0 +1,31 @@ > > +/* { dg-do compile } */ > > +/* { dg-add-options vect_early_break } */ > > +/* { dg-require-effective-target vect_early_break } */ > > +/* { dg-require-effective-target vect_int } */ > > + > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > > +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ > > + > > +#ifndef N > > +#define N 800 > > +#endif > > +unsigned vect_a[N]; > > +unsigned vect_b[N]; > > + > > +unsigned test4(unsigned x) > > +{ > > + unsigned ret = 0; > > + for (int i = 0; i < N; i+=2) > > + { > > + vect_b[i] = x + i; > > + vect_b[i+1] = x + i+1; > > + if (vect_a[i]*2 != x) > > + break; > > + if (vect_a[i+1]*2 != x) > > + break; > > + vect_a[i] = x; > > + vect_a[i+1] = x; > > + > > + } > > + return ret; > > +} > > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > > index > 600987dd6e5d506aa5fbb02350f9dab77793d382..7e765df466a59249feb999c > 24d8f2dad232948ae 100644 > > --- a/gcc/tree-vect-slp.cc > > +++ b/gcc/tree-vect-slp.cc > > @@ -3697,6 +3697,13 @@ vect_build_slp_instance (vec_info *vinfo, > > "Analyzing vectorizable constructor: %G\n", > > root_stmt_infos[0]->stmt); > > } > > + else if (kind == slp_inst_kind_gcond) > > + { > > + if (dump_enabled_p ()) > > + dump_printf_loc (MSG_NOTE, vect_location, > > + "Analyzing vectorizable control flow: %G", > > + root_stmt_infos[0]->stmt); > > + } > > > > if (dump_enabled_p ()) > > { > > @@ -4143,6 +4150,12 @@ vect_analyze_slp_instance (vec_info *vinfo, > > STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)) > > = STMT_VINFO_REDUC_DEF (vect_orig_stmt (scalar_stmts.last ())); > > } > > + else if (kind == slp_inst_kind_gcond) > > + { > > + /* Collect the stores and store them in scalar_stmts. */ > > + scalar_stmts.create (1); > > + scalar_stmts.quick_push (vect_stmt_to_vectorize (next_info)); > > + } > > We have this "left-over" dual API but since you never call > vect_analyze_slp_instance with slp_inst_kind_gcond but use > vect_build_slp_instance I think you can drop this hunk. > > > else > > gcc_unreachable (); > > > > @@ -4742,6 +4755,56 @@ vect_analyze_slp (vec_info *vinfo, unsigned > max_tree_size, > > bst_map, NULL, force_single_lane); > > } > > } > > + > > + /* Find SLP sequences starting from gconds. */ > > + for (auto cond : LOOP_VINFO_LOOP_CONDS (loop_vinfo)) > > + { > > + auto cond_info = loop_vinfo->lookup_stmt (cond); > > + vec<stmt_vec_info> stmts; > > + vec<stmt_vec_info> roots = vNULL; > > + vec<tree> remain = vNULL; > > + > > + cond_info = vect_stmt_to_vectorize (cond_info); > > + roots.safe_push (cond_info); > > + stmts.create (2); > > + tree args0 = gimple_cond_lhs (STMT_VINFO_STMT (cond_info)); > > + tree args1 = gimple_cond_rhs (STMT_VINFO_STMT (cond_info)); > > + /* An argument without a loop def will be codegened from vectorizing > the > > + root gcond itself. As such we don't need to try to build an SLP tree > > + from them. It's highly likely that the resulting SLP tree here if both > > + arguments have a def will be incompatible, but we rely on it being split > > + later on. */ > > + if (auto varg = loop_vinfo->lookup_def (args0)) > > + stmts.quick_push (vect_stmt_to_vectorize (varg)); > > + > > + if (auto varg = loop_vinfo->lookup_def (args1)) > > + stmts.quick_push (vect_stmt_to_vectorize (varg)); > > Hmm - you pattern-replace > > if (a != b) > > with > > patt = a != b; > if (patt != 0) > > correct? So I don't get why you push two scalar stmts in 'stmts' > and not just the pattern def of args0? Plus assert the condition > is NE_EXPR and the cond_rhs zero? (or simply guard discovery with > that - other gconds will simply not be vectorized) Yeah, the reason it can fail is that the gcond can be entirely loop invariant and the condition already a != 0, in which case gcond lowering would have done nothing. e.g. if (a != 0) where a is loop invariant. For instance test_memcmp_1_1 in /gcc.dg/memcmp-1.c is such loop. Technically we should be able to vectorize such loops, but while we can represent externals in the SLP tree, we can't start discovery at them, as no stmt_info for them. In principle all I need here is an empty SLP tree, since all codegen is driven by the roots for such invariant compares. However vect_build_slp_tree doesn't accept empty stmts. I believe we are able to vectorize such loops today, so perhaps instead of failing we should support building an SLP instance with only roots? In which case should I try to fit it into vect_build_slp_tree or just special case it for the gcond discovery? Thanks, Tamar > > That said, when loop_vinfo->lookup_def (args1) is not NULL > what you ask for doesn't really make sense - even if it discovers > OK (you get vectors with the condition lhs and rhs interleaved). > > > + if (!stmts.is_empty ()) > > + vect_build_slp_instance (vinfo, slp_inst_kind_gcond, > > + stmts, roots, remain, > > + max_tree_size, &limit, > > + bst_map, NULL, force_single_lane); > > + } > > + > > + /* Find and create slp instances for inductions that have been forced > > + live due to early break. */ > > + edge latch_e = loop_latch_edge (LOOP_VINFO_LOOP (loop_vinfo)); > > + for (auto stmt_info : LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS > (loop_vinfo)) > > + { > > + vec<stmt_vec_info> stmts; > > + vec<stmt_vec_info> roots = vNULL; > > + vec<tree> remain = vNULL; > > + gphi *lc_phi = as_a<gphi *> (STMT_VINFO_STMT (stmt_info)); > > + tree def = gimple_phi_arg_def_from_edge (lc_phi, latch_e); > > Hmm, so this isn't the LC PHI but the header PHI! So maybe better named > LOOP_VINFO_EARLY_BREAKS_LIVE_IVS? > > Note we're now walking all actual LC PHIs, but only when marked as > live. So maybe you can instead of adding to > LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS mark the defs appropriately? > Just a suggestion - I think what you do works as well. > > > + stmt_vec_info lc_info = loop_vinfo->lookup_def (def); > > + stmts.create (1); > > + stmts.quick_push (vect_stmt_to_vectorize (lc_info)); > > + vect_build_slp_instance (vinfo, slp_inst_kind_reduc_group, > > + stmts, roots, remain, > > + max_tree_size, &limit, > > + bst_map, NULL, force_single_lane); > > + } > > } > > > > hash_set<slp_tree> visited_patterns; > > @@ -7157,8 +7220,9 @@ maybe_push_to_hybrid_worklist (vec_info *vinfo, > > } > > } > > } > > - /* No def means this is a loo_vect sink. */ > > - if (!any_def) > > + /* No def means this is a loop_vect sink. Gimple conditionals also don't have a > > + def but shouldn't be considered sinks. */ > > + if (!any_def && STMT_VINFO_DEF_TYPE (stmt_info) != vect_condition_def) > > { > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_NOTE, vect_location, > > @@ -7542,9 +7606,27 @@ vect_slp_analyze_node_operations (vec_info *vinfo, > slp_tree node, > > return true; > > visited_vec.safe_push (node); > > > > + /* If early break also check the root statement as we need to both analyze > > + and trigger codegen for it. The analysis will check whether can actually > > + vectorize it. At the memoment splitting off the analsysi bit from inside > > + it duplicates a lot of the setup code so it's not worth while to do so. > > + However when either the non-SLP loop vect goes away or we split > vectorizable_* > > + functions then we can call the analysis only part from here instead. */ > > bool res = true; > > - unsigned visited_rec_start = visited_vec.length (); > > unsigned cost_vec_rec_start = cost_vec->length (); > > + if (SLP_INSTANCE_KIND (node_instance) == slp_inst_kind_gcond) > > + { > > + auto root_stmt_info = SLP_INSTANCE_ROOT_STMTS (node_instance)[0]; > > + res = vectorizable_early_exit (vinfo, root_stmt_info, NULL, NULL, NULL, > > + cost_vec); > > Can you do this here instead: > > bool > vect_slp_analyze_operations (vec_info *vinfo) > { > ... > /* Check we can vectorize the reduction. */ > || (SLP_INSTANCE_KIND (instance) == slp_inst_kind_bb_reduc > && !vectorizable_bb_reduc_epilogue (instance, &cost_vec))) > > as we're checking the roots for slp_inst_kind_ctor and > slp_inst_kind_bb_reduc here already. > > > + if (!res) > > + { > > + cost_vec->truncate (cost_vec_rec_start); > > + return res; > > + } > > + } > > + > > + unsigned visited_rec_start = visited_vec.length (); > > bool seen_non_constant_child = false; > > FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child) > > { > > @@ -8612,6 +8694,8 @@ vect_slp_check_for_roots (bb_vec_info bb_vinfo) > > !gsi_end_p (gsi); gsi_next (&gsi)) > > { > > gassign *assign = dyn_cast<gassign *> (gsi_stmt (gsi)); > > + /* This can be used to start SLP discovery for early breaks for BB early breaks > > + when we get that far. */ > > if (!assign) > > continue; > > > > @@ -10758,7 +10842,7 @@ vect_remove_slp_scalar_calls (vec_info *vinfo, > slp_tree node) > > /* Vectorize the instance root. */ > > > > void > > -vectorize_slp_instance_root_stmt (slp_tree node, slp_instance instance) > > +vectorize_slp_instance_root_stmt (vec_info *vinfo, slp_tree node, slp_instance > instance) > > { > > gassign *rstmt = NULL; > > > > @@ -10862,6 +10946,21 @@ vectorize_slp_instance_root_stmt (slp_tree node, > slp_instance instance) > > update_stmt (gsi_stmt (rgsi)); > > return; > > } > > + else if (instance->kind == slp_inst_kind_gcond) > > + { > > + /* Only support a single root for now as we can't codegen CFG yet and so we > > + can't support lane > 1 at this time. */ > > + gcc_assert (instance->root_stmts.length () == 1); > > + auto root_stmt_info = instance->root_stmts[0]; > > + auto last_stmt = vect_find_first_scalar_stmt_in_slp (node)->stmt; > > + gimple_stmt_iterator rgsi = gsi_for_stmt (last_stmt); > > So last_stmt is actually the gcond (and thus root_stmt_info)? Why > not use that for the rgsi directly? > > > + gimple *vec_stmt = NULL; > > + gcc_assert (SLP_TREE_NUMBER_OF_VEC_STMTS (node) != 0); > > I guess you'd want to assert SLP_TREE_VEC_DEFS is not empty? > > > + bool res = vectorizable_early_exit (vinfo, root_stmt_info, &rgsi, > > + &vec_stmt, node, NULL); > > + gcc_assert (res); > > + return; > > + } > > else > > gcc_unreachable (); > > > > @@ -11080,7 +11179,7 @@ vect_schedule_slp (vec_info *vinfo, const > vec<slp_instance> &slp_instances) > > vect_schedule_scc (vinfo, node, instance, scc_info, maxdfs, stack); > > > > if (!SLP_INSTANCE_ROOT_STMTS (instance).is_empty ()) > > - vectorize_slp_instance_root_stmt (node, instance); > > + vectorize_slp_instance_root_stmt (vinfo, node, instance); > > > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_NOTE, vect_location, > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > > index > b72b54d666879d8485f8d972b4e8d9dc64bc86b3..8f3f35989879199ffd0eb24 > 729cb7ade856a3c4d 100644 > > --- a/gcc/tree-vect-stmts.cc > > +++ b/gcc/tree-vect-stmts.cc > > @@ -411,6 +411,7 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, > loop_vec_info loop_vinfo, > > dump_printf_loc (MSG_NOTE, vect_location, > > "vec_stmt_relevant_p: induction forced for " > > "early break.\n"); > > + LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS (loop_vinfo).safe_push > (stmt_info); > > *live_p = true; > > ah, so it's even live already - what's the relevancy used? See above > for the SLP discovery question regarding to that we already walk all > actual LC PHI defs. > > Otherwise looks OK - thanks for the work! > > Richard. > > > > > } > > @@ -12933,7 +12934,7 @@ vectorizable_comparison (vec_info *vinfo, > > /* Check to see if the current early break given in STMT_INFO is valid for > > vectorization. */ > > > > -static bool > > +bool > > vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, > > gimple_stmt_iterator *gsi, gimple **vec_stmt, > > slp_tree slp_node, stmt_vector_for_cost *cost_vec) > > @@ -12958,7 +12959,7 @@ vectorizable_early_exit (vec_info *vinfo, > stmt_vec_info stmt_info, > > tree op0; > > enum vect_def_type dt0; > > if (!vect_is_simple_use (vinfo, stmt_info, slp_node, 0, &op0, &slp_op0, &dt0, > > - &vectype)) > > + &vectype)) > > { > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > @@ -12966,6 +12967,13 @@ vectorizable_early_exit (vec_info *vinfo, > stmt_vec_info stmt_info, > > return false; > > } > > > > + /* For SLP we don't want to use the type of the operands of the SLP node, > when > > + vectorizing using SLP slp_node will be the children of the gcond and we want > to > > + use the type of the direct children which since the gcond is root will be the > > + current node, rather than a child node as vect_is_simple_use assumes. */ > > + if (slp_node) > > + vectype = SLP_TREE_VECTYPE (slp_node); > > + > > if (!vectype) > > return false; > > > > @@ -13060,9 +13068,18 @@ vectorizable_early_exit (vec_info *vinfo, > stmt_vec_info stmt_info, > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_NOTE, vect_location, "transform early-exit.\n"); > > > > - if (!vectorizable_comparison_1 (vinfo, vectype, stmt_info, code, gsi, > > - vec_stmt, slp_node, cost_vec)) > > - gcc_unreachable (); > > + /* For SLP we don't do codegen of the body starting from the gcond, the > gconds are > > + roots and so by the time we get to them we have already codegened the SLP > tree > > + and so we shouldn't try to do so again. The arguments have already been > > + vectorized. It's not very clean to do this here, But the masking code below is > > + complex and this keeps it all in one place to ease fixes and backports. Once > we > > + drop the non-SLP loop vect or split vectorizable_* this can be simplified. */ > > + if (!slp_node) > > + { > > + if (!vectorizable_comparison_1 (vinfo, vectype, stmt_info, code, gsi, > > + vec_stmt, slp_node, cost_vec)) > > + gcc_unreachable (); > > + } > > > > gimple *stmt = STMT_VINFO_STMT (stmt_info); > > basic_block cond_bb = gimple_bb (stmt); > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h > > index > 490061aea2f6d465d9589eb97bbd34a920d76b1c..53483303c4ac3482760fe72 > 2354f602e0243e5a2 100644 > > --- a/gcc/tree-vectorizer.h > > +++ b/gcc/tree-vectorizer.h > > @@ -252,7 +252,8 @@ enum slp_instance_kind { > > slp_inst_kind_reduc_group, > > slp_inst_kind_reduc_chain, > > slp_inst_kind_bb_reduc, > > - slp_inst_kind_ctor > > + slp_inst_kind_ctor, > > + slp_inst_kind_gcond > > }; > > > > /* SLP instance is a sequence of stmts in a loop that can be packed into > > @@ -977,6 +978,10 @@ public: > > /* Statements whose VUSES need updating if early break vectorization is to > > happen. */ > > auto_vec<gimple*> early_break_vuses; > > + > > + /* Record statements that are needed to be live for early break vectorization > > + but may not have an LC PHI node materialized yet in the exits. */ > > + auto_vec<stmt_vec_info> early_break_live_stmts; > > } *loop_vec_info; > > > > /* Access Functions. */ > > @@ -1036,6 +1041,8 @@ public: > > #define LOOP_VINFO_EARLY_BRK_STORES(L) (L)->early_break_stores > > #define LOOP_VINFO_EARLY_BREAKS_VECT_PEELED(L) \ > > (single_pred ((L)->loop->latch) != (L)->vec_loop_iv_exit->src) > > +#define LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS(L) \ > > + (L)->early_break_live_stmts > > #define LOOP_VINFO_EARLY_BRK_DEST_BB(L) (L)->early_break_dest_bb > > #define LOOP_VINFO_EARLY_BRK_VUSES(L) (L)->early_break_vuses > > #define LOOP_VINFO_LOOP_CONDS(L) (L)->conds > > @@ -2522,6 +2529,9 @@ extern bool vectorizable_phi (vec_info *, > stmt_vec_info, gimple **, slp_tree, > > stmt_vector_for_cost *); > > extern bool vectorizable_recurr (loop_vec_info, stmt_vec_info, > > gimple **, slp_tree, stmt_vector_for_cost *); > > +extern bool vectorizable_early_exit (vec_info *, stmt_vec_info, > > + gimple_stmt_iterator *, gimple **, > > + slp_tree, stmt_vector_for_cost *); > > extern bool vect_emulated_vector_p (tree); > > extern bool vect_can_vectorize_without_simd_p (tree_code); > > extern bool vect_can_vectorize_without_simd_p (code_helper); > > > > > > > > > > > > -- > Richard Biener <rguenther@suse.de> > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
On Tue, 8 Oct 2024, Tamar Christina wrote: > > -----Original Message----- > > From: Richard Biener <rguenther@suse.de> > > Sent: Wednesday, October 2, 2024 1:50 PM > > To: Tamar Christina <Tamar.Christina@arm.com> > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; jlaw@ventanamicro.com > > Subject: Re: [PATCH]middle-end: support SLP early break > > > > On Tue, 1 Oct 2024, Tamar Christina wrote: > > > > > Hi all, > > > > > > This patch introduces feature parity for early break int the SLP only > > > vectorizer. > > > > > > The approach taken here is to treat the early exits as root statements for an > > > SLP tree. This means that we don't need any changes to build_slp to support > > > gconds. > > > > > > Codegen for the gcond itself now has to be done out of line but the body of the > > > SLP blocks itself is simply driven by SLP scheduling. There is a slight > > > awkwardness in having re-used vectorizable_early_exit for both SLP and non-SLP > > > but I've documented the differences and when I did try to refactor it it wasn't > > > really worth it given that this is a temporary state anyway. > > > > > > This version is restricted to lane = 1, as such we can re-use the existing > > > move_early_break function instead of having to do safety update through > > > scheduling. I have a branch where I'm working on that but lane > 1 is out of > > > scope for GCC 15 anyway. The only reason I will try to get moving through > > > scheduling done as a stretch goal is so we get epilogue vectorization back for > > > early break. > > > > > > The example: > > > > > > unsigned test4(unsigned x) > > > { > > > unsigned ret = 0; > > > for (int i = 0; i < N; i++) > > > { > > > vect_b[i] = x + i; > > > if (vect_a[i]*2 != x) > > > break; > > > vect_a[i] = x; > > > > > > } > > > return ret; > > > } > > > > > > builds the following SLP instance for early break: > > > > > > note: Analyzing vectorizable control flow: if (patt_6 != 0) > > > note: Starting SLP discovery for > > > note: patt_6 = _4 != x_9(D); > > > note: starting SLP discovery for node 0x63abc80 > > > note: Build SLP for patt_6 = _4 != x_9(D); > > > note: precomputed vectype: vector(4) <signed-boolean:32> > > > note: nunits = 4 > > > note: vect_is_simple_use: operand x_9(D), type of def: external > > > note: vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, +INF] > > MASK 0xffff > > > _3 * 2, type of def: internal > > > note: starting SLP discovery for node 0x63abdc0 > > > note: Build SLP for _4 = _3 * 2; > > > note: precomputed vectype: vector(4) unsigned int > > > note: nunits = 4 > > > note: vect_is_simple_use: operand # > > > vect_aD.4416[i_15], type of def: internal > > > note: vect_is_simple_use: operand 2, type of def: constant > > > note: starting SLP discovery for node 0x63abe60 > > > note: Build SLP for _3 = vect_a[i_15]; > > > note: precomputed vectype: vector(4) unsigned int > > > note: nunits = 4 > > > note: SLP discovery for node 0x63abe60 succeeded > > > note: SLP discovery for node 0x63abdc0 succeeded > > > note: SLP discovery for node 0x63abc80 succeeded > > > note: SLP size 3 vs. limit 10. > > > note: Final SLP tree for instance 0x6474190: > > > note: node 0x63abc80 (max_nunits=4, refcnt=2) vector(4) <signed-boolean:32> > > > note: op template: patt_6 = _4 != x_9(D); > > > note: stmt 0 patt_6 = _4 != x_9(D); > > > note: children 0x63abd20 0x63abdc0 > > > note: node (external) 0x63abd20 (max_nunits=1, refcnt=1) > > > note: { x_9(D) } > > > note: node 0x63abdc0 (max_nunits=4, refcnt=2) vector(4) unsigned int > > > note: op template: _4 = _3 * 2; > > > note: stmt 0 _4 = _3 * 2; > > > note: children 0x63abe60 0x63abf00 > > > note: node 0x63abe60 (max_nunits=4, refcnt=2) vector(4) unsigned int > > > note: op template: _3 = vect_a[i_15]; > > > note: stmt 0 _3 = vect_a[i_15]; > > > note: load permutation { 0 } > > > note: node (constant) 0x63abf00 (max_nunits=1, refcnt=1) > > > note: { 2 } > > > > > > and during codegen: > > > > > > note: ------>vectorizing SLP node starting from: patt_6 = _4 != x_9(D); > > > note: vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, +INF] > > MASK 0xffff > > > _3 * 2, type of def: internal > > > note: add new stmt: mask_patt_6.18_58 = _53 != vect__4.17_57; > > > note: === vectorizable_early_exit === > > > note: transform early-exit. > > > note: vectorizing stmts using SLP. > > > note: Vectorizing SLP tree: > > > note: node 0x63abfa0 (max_nunits=4, refcnt=1) vector(4) int > > > note: op template: i_12 = i_15 + 1; > > > note: stmt 0 i_12 = i_15 + 1; > > > note: children 0x63aba00 0x63ac040 > > > note: node 0x63aba00 (max_nunits=4, refcnt=2) vector(4) int > > > note: op template: i_15 = PHI <i_12(6), 0(14)> > > > note: [l] stmt 0 i_15 = PHI <i_12(6), 0(14)> > > > note: children (nil) (nil) > > > note: node (constant) 0x63ac040 (max_nunits=1, refcnt=1) vector(4) int > > > note: { 1 } > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf, > > > x86_64-pc-linux-gnu -m32, -m64 and no issues. > > > > > > Also bootstrapped --with-build-config='bootstrap-O3 bootstrap-lto' > > > --enable-checking=release,yes,rtl,extra on aarch64-none-linux-gnu and > > > x86_64-pc-linux-gnu -m32, -m64 and no issues. > > > > > > Ok for master? > > > > > > gcc/ChangeLog: > > > > > > * tree-vectorizer.h (enum slp_instance_kind): Add slp_inst_kind_gcond. > > > (LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS): New. > > > (vectorizable_early_exit): Expose. > > > (class _loop_vec_info): Add early_break_live_stmts. > > > * tree-vect-slp.cc (vect_build_slp_instance, vect_analyze_slp_instance): > > > Support gcond instances. > > > (vect_analyze_slp): Analyze gcond roots and early break live statements. > > > (maybe_push_to_hybrid_worklist): Don't sink gconds. > > > (vect_slp_analyze_node_operations): Support gconds. > > > (vect_slp_check_for_roots): Update comments. > > > (vectorize_slp_instance_root_stmt): Support gconds. > > > (vect_schedule_slp): Pass vinfo to vectorize_slp_instance_root_stmt. > > > * tree-vect-stmts.cc (vect_stmt_relevant_p): Record early break live > > > statements. > > > (vectorizable_early_exit): Support SLP. > > > > > > gcc/testsuite/ChangeLog: > > > > > > * gcc.dg/vect/vect-early-break_126.c: New test. > > > * gcc.dg/vect/vect-early-break_127.c: New test. > > > * gcc.dg/vect/vect-early-break_128.c: New test. > > > > > > --- > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c > > b/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c > > > new file mode 100644 > > > index > > 0000000000000000000000000000000000000000..4bfc9880f9fc869bf616123 > > ff509d13be17ffacf > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c > > > @@ -0,0 +1,28 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-add-options vect_early_break } */ > > > +/* { dg-require-effective-target vect_early_break } */ > > > +/* { dg-require-effective-target vect_int } */ > > > + > > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > > > +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ > > > + > > > +#define N 1024 > > > +unsigned vect_a[N]; > > > +unsigned vect_b[N]; > > > + > > > +unsigned test4(unsigned x) > > > +{ > > > + unsigned ret = 0; > > > + for (int i = 0; i < N; i++) > > > + { > > > + vect_b[i] = x + i; > > > + if (vect_a[i] > x) > > > + { > > > + ret *= vect_a[i]; > > > + return vect_a[i]; > > > + } > > > + vect_a[i] = x; > > > + ret += vect_a[i] + vect_b[i]; > > > + } > > > + return ret; > > > +} > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c > > b/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c > > > new file mode 100644 > > > index > > 0000000000000000000000000000000000000000..67cb5d34a77192e5d7d72 > > c35df8e83535ef184ab > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c > > > @@ -0,0 +1,27 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-add-options vect_early_break } */ > > > +/* { dg-require-effective-target vect_early_break } */ > > > +/* { dg-require-effective-target vect_int } */ > > > + > > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > > > +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ > > > + > > > +#ifndef N > > > +#define N 800 > > > +#endif > > > +unsigned vect_a[N]; > > > +unsigned vect_b[N]; > > > + > > > +unsigned test4(unsigned x) > > > +{ > > > + unsigned ret = 0; > > > + for (int i = 0; i < N; i++) > > > + { > > > + vect_b[i] = x + i; > > > + if (vect_a[i]*2 != x) > > > + break; > > > + vect_a[i] = x; > > > + > > > + } > > > + return ret; > > > +} > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c > > b/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c > > > new file mode 100644 > > > index > > 0000000000000000000000000000000000000000..6d7fb920ec2de529a4aa1d > > e2c4a04286989204fd > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c > > > @@ -0,0 +1,31 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-add-options vect_early_break } */ > > > +/* { dg-require-effective-target vect_early_break } */ > > > +/* { dg-require-effective-target vect_int } */ > > > + > > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > > > +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ > > > + > > > +#ifndef N > > > +#define N 800 > > > +#endif > > > +unsigned vect_a[N]; > > > +unsigned vect_b[N]; > > > + > > > +unsigned test4(unsigned x) > > > +{ > > > + unsigned ret = 0; > > > + for (int i = 0; i < N; i+=2) > > > + { > > > + vect_b[i] = x + i; > > > + vect_b[i+1] = x + i+1; > > > + if (vect_a[i]*2 != x) > > > + break; > > > + if (vect_a[i+1]*2 != x) > > > + break; > > > + vect_a[i] = x; > > > + vect_a[i+1] = x; > > > + > > > + } > > > + return ret; > > > +} > > > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > > > index > > 600987dd6e5d506aa5fbb02350f9dab77793d382..7e765df466a59249feb999c > > 24d8f2dad232948ae 100644 > > > --- a/gcc/tree-vect-slp.cc > > > +++ b/gcc/tree-vect-slp.cc > > > @@ -3697,6 +3697,13 @@ vect_build_slp_instance (vec_info *vinfo, > > > "Analyzing vectorizable constructor: %G\n", > > > root_stmt_infos[0]->stmt); > > > } > > > + else if (kind == slp_inst_kind_gcond) > > > + { > > > + if (dump_enabled_p ()) > > > + dump_printf_loc (MSG_NOTE, vect_location, > > > + "Analyzing vectorizable control flow: %G", > > > + root_stmt_infos[0]->stmt); > > > + } > > > > > > if (dump_enabled_p ()) > > > { > > > @@ -4143,6 +4150,12 @@ vect_analyze_slp_instance (vec_info *vinfo, > > > STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)) > > > = STMT_VINFO_REDUC_DEF (vect_orig_stmt (scalar_stmts.last ())); > > > } > > > + else if (kind == slp_inst_kind_gcond) > > > + { > > > + /* Collect the stores and store them in scalar_stmts. */ > > > + scalar_stmts.create (1); > > > + scalar_stmts.quick_push (vect_stmt_to_vectorize (next_info)); > > > + } > > > > We have this "left-over" dual API but since you never call > > vect_analyze_slp_instance with slp_inst_kind_gcond but use > > vect_build_slp_instance I think you can drop this hunk. > > > > > else > > > gcc_unreachable (); > > > > > > @@ -4742,6 +4755,56 @@ vect_analyze_slp (vec_info *vinfo, unsigned > > max_tree_size, > > > bst_map, NULL, force_single_lane); > > > } > > > } > > > + > > > + /* Find SLP sequences starting from gconds. */ > > > + for (auto cond : LOOP_VINFO_LOOP_CONDS (loop_vinfo)) > > > + { > > > + auto cond_info = loop_vinfo->lookup_stmt (cond); > > > + vec<stmt_vec_info> stmts; > > > + vec<stmt_vec_info> roots = vNULL; > > > + vec<tree> remain = vNULL; > > > + > > > + cond_info = vect_stmt_to_vectorize (cond_info); > > > + roots.safe_push (cond_info); > > > + stmts.create (2); > > > + tree args0 = gimple_cond_lhs (STMT_VINFO_STMT (cond_info)); > > > + tree args1 = gimple_cond_rhs (STMT_VINFO_STMT (cond_info)); > > > + /* An argument without a loop def will be codegened from vectorizing > > the > > > + root gcond itself. As such we don't need to try to build an SLP tree > > > + from them. It's highly likely that the resulting SLP tree here if both > > > + arguments have a def will be incompatible, but we rely on it being split > > > + later on. */ > > > + if (auto varg = loop_vinfo->lookup_def (args0)) > > > + stmts.quick_push (vect_stmt_to_vectorize (varg)); > > > + > > > + if (auto varg = loop_vinfo->lookup_def (args1)) > > > + stmts.quick_push (vect_stmt_to_vectorize (varg)); > > > > Hmm - you pattern-replace > > > > if (a != b) > > > > with > > > > patt = a != b; > > if (patt != 0) > > > > correct? So I don't get why you push two scalar stmts in 'stmts' > > and not just the pattern def of args0? Plus assert the condition > > is NE_EXPR and the cond_rhs zero? (or simply guard discovery with > > that - other gconds will simply not be vectorized) > > Yeah, the reason it can fail is that the gcond can be entirely loop invariant > and the condition already a != 0, in which case gcond lowering would have > done nothing. > > e.g. if (a != 0) where a is loop invariant. For instance test_memcmp_1_1 > in /gcc.dg/memcmp-1.c is such loop. Technically we should be able to > vectorize such loops, but while we can represent externals in the SLP tree, > we can't start discovery at them, as no stmt_info for them. > > In principle all I need here is an empty SLP tree, since all codegen is driven > by the roots for such invariant compares. However vect_build_slp_tree > doesn't accept empty stmts. The externals would have SLP nodes of course but the requirement currently is that the SLP instance root is an internal def. > I believe we are able to vectorize such loops today, so perhaps instead of > failing we should support building an SLP instance with only roots? It might be tempting but I don't think this is generally useful. > In which case should I try to fit it into vect_build_slp_tree or just special > case it for the gcond discovery? The issue is that you have two operands you technically would like to see code-genrated - the 'a' and the '0' vector invariants, but the SLP instance only has a single root. You could (as I suggested) simply only build the SLP node for the (invariant) LHS of the gcond, not by using vect_build_slp_tree but instead by manually building the SLP tree for the invariant - see what vect_build_slp_tree_2 does here: if (oprnd_info->first_dt == vect_external_def || oprnd_info->first_dt == vect_constant_def) { if (!GET_MODE_SIZE (vinfo->vector_mode).is_constant ()) { tree op0; tree uniform_val = op0 = oprnd_info->ops[0]; for (j = 1; j < oprnd_info->ops.length (); ++j) if (!operand_equal_p (uniform_val, oprnd_info->ops[j])) { uniform_val = NULL_TREE; break; } if (!uniform_val && !can_duplicate_and_interleave_p (vinfo, oprnd_info->ops.length (), TREE_TYPE (op0))) { matches[j] = false; if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "Build SLP failed: invalid type of def " "for variable-length SLP %T\n", op0); goto fail; } } slp_tree invnode = vect_create_new_slp_node (oprnd_info->ops); SLP_TREE_DEF_TYPE (invnode) = oprnd_info->first_dt; oprnd_info->ops = vNULL; children.safe_push (invnode); continue; It should probably work fine then? Richard. > > Thanks, > Tamar > > > > That said, when loop_vinfo->lookup_def (args1) is not NULL > > what you ask for doesn't really make sense - even if it discovers > > OK (you get vectors with the condition lhs and rhs interleaved). > > > > > + if (!stmts.is_empty ()) > > > + vect_build_slp_instance (vinfo, slp_inst_kind_gcond, > > > + stmts, roots, remain, > > > + max_tree_size, &limit, > > > + bst_map, NULL, force_single_lane); > > > + } > > > + > > > + /* Find and create slp instances for inductions that have been forced > > > + live due to early break. */ > > > + edge latch_e = loop_latch_edge (LOOP_VINFO_LOOP (loop_vinfo)); > > > + for (auto stmt_info : LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS > > (loop_vinfo)) > > > + { > > > + vec<stmt_vec_info> stmts; > > > + vec<stmt_vec_info> roots = vNULL; > > > + vec<tree> remain = vNULL; > > > + gphi *lc_phi = as_a<gphi *> (STMT_VINFO_STMT (stmt_info)); > > > + tree def = gimple_phi_arg_def_from_edge (lc_phi, latch_e); > > > > Hmm, so this isn't the LC PHI but the header PHI! So maybe better named > > LOOP_VINFO_EARLY_BREAKS_LIVE_IVS? > > > > Note we're now walking all actual LC PHIs, but only when marked as > > live. So maybe you can instead of adding to > > LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS mark the defs appropriately? > > Just a suggestion - I think what you do works as well. > > > > > + stmt_vec_info lc_info = loop_vinfo->lookup_def (def); > > > + stmts.create (1); > > > + stmts.quick_push (vect_stmt_to_vectorize (lc_info)); > > > + vect_build_slp_instance (vinfo, slp_inst_kind_reduc_group, > > > + stmts, roots, remain, > > > + max_tree_size, &limit, > > > + bst_map, NULL, force_single_lane); > > > + } > > > } > > > > > > hash_set<slp_tree> visited_patterns; > > > @@ -7157,8 +7220,9 @@ maybe_push_to_hybrid_worklist (vec_info *vinfo, > > > } > > > } > > > } > > > - /* No def means this is a loo_vect sink. */ > > > - if (!any_def) > > > + /* No def means this is a loop_vect sink. Gimple conditionals also don't have a > > > + def but shouldn't be considered sinks. */ > > > + if (!any_def && STMT_VINFO_DEF_TYPE (stmt_info) != vect_condition_def) > > > { > > > if (dump_enabled_p ()) > > > dump_printf_loc (MSG_NOTE, vect_location, > > > @@ -7542,9 +7606,27 @@ vect_slp_analyze_node_operations (vec_info *vinfo, > > slp_tree node, > > > return true; > > > visited_vec.safe_push (node); > > > > > > + /* If early break also check the root statement as we need to both analyze > > > + and trigger codegen for it. The analysis will check whether can actually > > > + vectorize it. At the memoment splitting off the analsysi bit from inside > > > + it duplicates a lot of the setup code so it's not worth while to do so. > > > + However when either the non-SLP loop vect goes away or we split > > vectorizable_* > > > + functions then we can call the analysis only part from here instead. */ > > > bool res = true; > > > - unsigned visited_rec_start = visited_vec.length (); > > > unsigned cost_vec_rec_start = cost_vec->length (); > > > + if (SLP_INSTANCE_KIND (node_instance) == slp_inst_kind_gcond) > > > + { > > > + auto root_stmt_info = SLP_INSTANCE_ROOT_STMTS (node_instance)[0]; > > > + res = vectorizable_early_exit (vinfo, root_stmt_info, NULL, NULL, NULL, > > > + cost_vec); > > > > Can you do this here instead: > > > > bool > > vect_slp_analyze_operations (vec_info *vinfo) > > { > > ... > > /* Check we can vectorize the reduction. */ > > || (SLP_INSTANCE_KIND (instance) == slp_inst_kind_bb_reduc > > && !vectorizable_bb_reduc_epilogue (instance, &cost_vec))) > > > > as we're checking the roots for slp_inst_kind_ctor and > > slp_inst_kind_bb_reduc here already. > > > > > + if (!res) > > > + { > > > + cost_vec->truncate (cost_vec_rec_start); > > > + return res; > > > + } > > > + } > > > + > > > + unsigned visited_rec_start = visited_vec.length (); > > > bool seen_non_constant_child = false; > > > FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child) > > > { > > > @@ -8612,6 +8694,8 @@ vect_slp_check_for_roots (bb_vec_info bb_vinfo) > > > !gsi_end_p (gsi); gsi_next (&gsi)) > > > { > > > gassign *assign = dyn_cast<gassign *> (gsi_stmt (gsi)); > > > + /* This can be used to start SLP discovery for early breaks for BB early breaks > > > + when we get that far. */ > > > if (!assign) > > > continue; > > > > > > @@ -10758,7 +10842,7 @@ vect_remove_slp_scalar_calls (vec_info *vinfo, > > slp_tree node) > > > /* Vectorize the instance root. */ > > > > > > void > > > -vectorize_slp_instance_root_stmt (slp_tree node, slp_instance instance) > > > +vectorize_slp_instance_root_stmt (vec_info *vinfo, slp_tree node, slp_instance > > instance) > > > { > > > gassign *rstmt = NULL; > > > > > > @@ -10862,6 +10946,21 @@ vectorize_slp_instance_root_stmt (slp_tree node, > > slp_instance instance) > > > update_stmt (gsi_stmt (rgsi)); > > > return; > > > } > > > + else if (instance->kind == slp_inst_kind_gcond) > > > + { > > > + /* Only support a single root for now as we can't codegen CFG yet and so we > > > + can't support lane > 1 at this time. */ > > > + gcc_assert (instance->root_stmts.length () == 1); > > > + auto root_stmt_info = instance->root_stmts[0]; > > > + auto last_stmt = vect_find_first_scalar_stmt_in_slp (node)->stmt; > > > + gimple_stmt_iterator rgsi = gsi_for_stmt (last_stmt); > > > > So last_stmt is actually the gcond (and thus root_stmt_info)? Why > > not use that for the rgsi directly? > > > > > + gimple *vec_stmt = NULL; > > > + gcc_assert (SLP_TREE_NUMBER_OF_VEC_STMTS (node) != 0); > > > > I guess you'd want to assert SLP_TREE_VEC_DEFS is not empty? > > > > > + bool res = vectorizable_early_exit (vinfo, root_stmt_info, &rgsi, > > > + &vec_stmt, node, NULL); > > > + gcc_assert (res); > > > + return; > > > + } > > > else > > > gcc_unreachable (); > > > > > > @@ -11080,7 +11179,7 @@ vect_schedule_slp (vec_info *vinfo, const > > vec<slp_instance> &slp_instances) > > > vect_schedule_scc (vinfo, node, instance, scc_info, maxdfs, stack); > > > > > > if (!SLP_INSTANCE_ROOT_STMTS (instance).is_empty ()) > > > - vectorize_slp_instance_root_stmt (node, instance); > > > + vectorize_slp_instance_root_stmt (vinfo, node, instance); > > > > > > if (dump_enabled_p ()) > > > dump_printf_loc (MSG_NOTE, vect_location, > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > > > index > > b72b54d666879d8485f8d972b4e8d9dc64bc86b3..8f3f35989879199ffd0eb24 > > 729cb7ade856a3c4d 100644 > > > --- a/gcc/tree-vect-stmts.cc > > > +++ b/gcc/tree-vect-stmts.cc > > > @@ -411,6 +411,7 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, > > loop_vec_info loop_vinfo, > > > dump_printf_loc (MSG_NOTE, vect_location, > > > "vec_stmt_relevant_p: induction forced for " > > > "early break.\n"); > > > + LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS (loop_vinfo).safe_push > > (stmt_info); > > > *live_p = true; > > > > ah, so it's even live already - what's the relevancy used? See above > > for the SLP discovery question regarding to that we already walk all > > actual LC PHI defs. > > > > Otherwise looks OK - thanks for the work! > > > > Richard. > > > > > > > > } > > > @@ -12933,7 +12934,7 @@ vectorizable_comparison (vec_info *vinfo, > > > /* Check to see if the current early break given in STMT_INFO is valid for > > > vectorization. */ > > > > > > -static bool > > > +bool > > > vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, > > > gimple_stmt_iterator *gsi, gimple **vec_stmt, > > > slp_tree slp_node, stmt_vector_for_cost *cost_vec) > > > @@ -12958,7 +12959,7 @@ vectorizable_early_exit (vec_info *vinfo, > > stmt_vec_info stmt_info, > > > tree op0; > > > enum vect_def_type dt0; > > > if (!vect_is_simple_use (vinfo, stmt_info, slp_node, 0, &op0, &slp_op0, &dt0, > > > - &vectype)) > > > + &vectype)) > > > { > > > if (dump_enabled_p ()) > > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > > @@ -12966,6 +12967,13 @@ vectorizable_early_exit (vec_info *vinfo, > > stmt_vec_info stmt_info, > > > return false; > > > } > > > > > > + /* For SLP we don't want to use the type of the operands of the SLP node, > > when > > > + vectorizing using SLP slp_node will be the children of the gcond and we want > > to > > > + use the type of the direct children which since the gcond is root will be the > > > + current node, rather than a child node as vect_is_simple_use assumes. */ > > > + if (slp_node) > > > + vectype = SLP_TREE_VECTYPE (slp_node); > > > + > > > if (!vectype) > > > return false; > > > > > > @@ -13060,9 +13068,18 @@ vectorizable_early_exit (vec_info *vinfo, > > stmt_vec_info stmt_info, > > > if (dump_enabled_p ()) > > > dump_printf_loc (MSG_NOTE, vect_location, "transform early-exit.\n"); > > > > > > - if (!vectorizable_comparison_1 (vinfo, vectype, stmt_info, code, gsi, > > > - vec_stmt, slp_node, cost_vec)) > > > - gcc_unreachable (); > > > + /* For SLP we don't do codegen of the body starting from the gcond, the > > gconds are > > > + roots and so by the time we get to them we have already codegened the SLP > > tree > > > + and so we shouldn't try to do so again. The arguments have already been > > > + vectorized. It's not very clean to do this here, But the masking code below is > > > + complex and this keeps it all in one place to ease fixes and backports. Once > > we > > > + drop the non-SLP loop vect or split vectorizable_* this can be simplified. */ > > > + if (!slp_node) > > > + { > > > + if (!vectorizable_comparison_1 (vinfo, vectype, stmt_info, code, gsi, > > > + vec_stmt, slp_node, cost_vec)) > > > + gcc_unreachable (); > > > + } > > > > > > gimple *stmt = STMT_VINFO_STMT (stmt_info); > > > basic_block cond_bb = gimple_bb (stmt); > > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h > > > index > > 490061aea2f6d465d9589eb97bbd34a920d76b1c..53483303c4ac3482760fe72 > > 2354f602e0243e5a2 100644 > > > --- a/gcc/tree-vectorizer.h > > > +++ b/gcc/tree-vectorizer.h > > > @@ -252,7 +252,8 @@ enum slp_instance_kind { > > > slp_inst_kind_reduc_group, > > > slp_inst_kind_reduc_chain, > > > slp_inst_kind_bb_reduc, > > > - slp_inst_kind_ctor > > > + slp_inst_kind_ctor, > > > + slp_inst_kind_gcond > > > }; > > > > > > /* SLP instance is a sequence of stmts in a loop that can be packed into > > > @@ -977,6 +978,10 @@ public: > > > /* Statements whose VUSES need updating if early break vectorization is to > > > happen. */ > > > auto_vec<gimple*> early_break_vuses; > > > + > > > + /* Record statements that are needed to be live for early break vectorization > > > + but may not have an LC PHI node materialized yet in the exits. */ > > > + auto_vec<stmt_vec_info> early_break_live_stmts; > > > } *loop_vec_info; > > > > > > /* Access Functions. */ > > > @@ -1036,6 +1041,8 @@ public: > > > #define LOOP_VINFO_EARLY_BRK_STORES(L) (L)->early_break_stores > > > #define LOOP_VINFO_EARLY_BREAKS_VECT_PEELED(L) \ > > > (single_pred ((L)->loop->latch) != (L)->vec_loop_iv_exit->src) > > > +#define LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS(L) \ > > > + (L)->early_break_live_stmts > > > #define LOOP_VINFO_EARLY_BRK_DEST_BB(L) (L)->early_break_dest_bb > > > #define LOOP_VINFO_EARLY_BRK_VUSES(L) (L)->early_break_vuses > > > #define LOOP_VINFO_LOOP_CONDS(L) (L)->conds > > > @@ -2522,6 +2529,9 @@ extern bool vectorizable_phi (vec_info *, > > stmt_vec_info, gimple **, slp_tree, > > > stmt_vector_for_cost *); > > > extern bool vectorizable_recurr (loop_vec_info, stmt_vec_info, > > > gimple **, slp_tree, stmt_vector_for_cost *); > > > +extern bool vectorizable_early_exit (vec_info *, stmt_vec_info, > > > + gimple_stmt_iterator *, gimple **, > > > + slp_tree, stmt_vector_for_cost *); > > > extern bool vect_emulated_vector_p (tree); > > > extern bool vect_can_vectorize_without_simd_p (tree_code); > > > extern bool vect_can_vectorize_without_simd_p (code_helper); > > > > > > > > > > > > > > > > > > > -- > > Richard Biener <rguenther@suse.de> > > SUSE Software Solutions Germany GmbH, > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) >
> -----Original Message----- > From: Richard Biener <rguenther@suse.de> > Sent: Wednesday, October 9, 2024 9:20 AM > To: Tamar Christina <Tamar.Christina@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; jlaw@ventanamicro.com > Subject: RE: [PATCH]middle-end: support SLP early break > > On Tue, 8 Oct 2024, Tamar Christina wrote: > > > > -----Original Message----- > > > From: Richard Biener <rguenther@suse.de> > > > Sent: Wednesday, October 2, 2024 1:50 PM > > > To: Tamar Christina <Tamar.Christina@arm.com> > > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; jlaw@ventanamicro.com > > > Subject: Re: [PATCH]middle-end: support SLP early break > > > > > > On Tue, 1 Oct 2024, Tamar Christina wrote: > > > > > > > Hi all, > > > > > > > > This patch introduces feature parity for early break int the SLP only > > > > vectorizer. > > > > > > > > The approach taken here is to treat the early exits as root statements for an > > > > SLP tree. This means that we don't need any changes to build_slp to support > > > > gconds. > > > > > > > > Codegen for the gcond itself now has to be done out of line but the body of > the > > > > SLP blocks itself is simply driven by SLP scheduling. There is a slight > > > > awkwardness in having re-used vectorizable_early_exit for both SLP and non- > SLP > > > > but I've documented the differences and when I did try to refactor it it wasn't > > > > really worth it given that this is a temporary state anyway. > > > > > > > > This version is restricted to lane = 1, as such we can re-use the existing > > > > move_early_break function instead of having to do safety update through > > > > scheduling. I have a branch where I'm working on that but lane > 1 is out of > > > > scope for GCC 15 anyway. The only reason I will try to get moving through > > > > scheduling done as a stretch goal is so we get epilogue vectorization back for > > > > early break. > > > > > > > > The example: > > > > > > > > unsigned test4(unsigned x) > > > > { > > > > unsigned ret = 0; > > > > for (int i = 0; i < N; i++) > > > > { > > > > vect_b[i] = x + i; > > > > if (vect_a[i]*2 != x) > > > > break; > > > > vect_a[i] = x; > > > > > > > > } > > > > return ret; > > > > } > > > > > > > > builds the following SLP instance for early break: > > > > > > > > note: Analyzing vectorizable control flow: if (patt_6 != 0) > > > > note: Starting SLP discovery for > > > > note: patt_6 = _4 != x_9(D); > > > > note: starting SLP discovery for node 0x63abc80 > > > > note: Build SLP for patt_6 = _4 != x_9(D); > > > > note: precomputed vectype: vector(4) <signed-boolean:32> > > > > note: nunits = 4 > > > > note: vect_is_simple_use: operand x_9(D), type of def: external > > > > note: vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, > +INF] > > > MASK 0xffff > > > > _3 * 2, type of def: internal > > > > note: starting SLP discovery for node 0x63abdc0 > > > > note: Build SLP for _4 = _3 * 2; > > > > note: precomputed vectype: vector(4) unsigned int > > > > note: nunits = 4 > > > > note: vect_is_simple_use: operand # > > > > vect_aD.4416[i_15], type of def: internal > > > > note: vect_is_simple_use: operand 2, type of def: constant > > > > note: starting SLP discovery for node 0x63abe60 > > > > note: Build SLP for _3 = vect_a[i_15]; > > > > note: precomputed vectype: vector(4) unsigned int > > > > note: nunits = 4 > > > > note: SLP discovery for node 0x63abe60 succeeded > > > > note: SLP discovery for node 0x63abdc0 succeeded > > > > note: SLP discovery for node 0x63abc80 succeeded > > > > note: SLP size 3 vs. limit 10. > > > > note: Final SLP tree for instance 0x6474190: > > > > note: node 0x63abc80 (max_nunits=4, refcnt=2) vector(4) <signed- > boolean:32> > > > > note: op template: patt_6 = _4 != x_9(D); > > > > note: stmt 0 patt_6 = _4 != x_9(D); > > > > note: children 0x63abd20 0x63abdc0 > > > > note: node (external) 0x63abd20 (max_nunits=1, refcnt=1) > > > > note: { x_9(D) } > > > > note: node 0x63abdc0 (max_nunits=4, refcnt=2) vector(4) unsigned int > > > > note: op template: _4 = _3 * 2; > > > > note: stmt 0 _4 = _3 * 2; > > > > note: children 0x63abe60 0x63abf00 > > > > note: node 0x63abe60 (max_nunits=4, refcnt=2) vector(4) unsigned int > > > > note: op template: _3 = vect_a[i_15]; > > > > note: stmt 0 _3 = vect_a[i_15]; > > > > note: load permutation { 0 } > > > > note: node (constant) 0x63abf00 (max_nunits=1, refcnt=1) > > > > note: { 2 } > > > > > > > > and during codegen: > > > > > > > > note: ------>vectorizing SLP node starting from: patt_6 = _4 != x_9(D); > > > > note: vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, > +INF] > > > MASK 0xffff > > > > _3 * 2, type of def: internal > > > > note: add new stmt: mask_patt_6.18_58 = _53 != vect__4.17_57; > > > > note: === vectorizable_early_exit === > > > > note: transform early-exit. > > > > note: vectorizing stmts using SLP. > > > > note: Vectorizing SLP tree: > > > > note: node 0x63abfa0 (max_nunits=4, refcnt=1) vector(4) int > > > > note: op template: i_12 = i_15 + 1; > > > > note: stmt 0 i_12 = i_15 + 1; > > > > note: children 0x63aba00 0x63ac040 > > > > note: node 0x63aba00 (max_nunits=4, refcnt=2) vector(4) int > > > > note: op template: i_15 = PHI <i_12(6), 0(14)> > > > > note: [l] stmt 0 i_15 = PHI <i_12(6), 0(14)> > > > > note: children (nil) (nil) > > > > note: node (constant) 0x63ac040 (max_nunits=1, refcnt=1) vector(4) int > > > > note: { 1 } > > > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux- > gnueabihf, > > > > x86_64-pc-linux-gnu -m32, -m64 and no issues. > > > > > > > > Also bootstrapped --with-build-config='bootstrap-O3 bootstrap-lto' > > > > --enable-checking=release,yes,rtl,extra on aarch64-none-linux-gnu and > > > > x86_64-pc-linux-gnu -m32, -m64 and no issues. > > > > > > > > Ok for master? > > > > > > > > gcc/ChangeLog: > > > > > > > > * tree-vectorizer.h (enum slp_instance_kind): Add slp_inst_kind_gcond. > > > > (LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS): New. > > > > (vectorizable_early_exit): Expose. > > > > (class _loop_vec_info): Add early_break_live_stmts. > > > > * tree-vect-slp.cc (vect_build_slp_instance, vect_analyze_slp_instance): > > > > Support gcond instances. > > > > (vect_analyze_slp): Analyze gcond roots and early break live statements. > > > > (maybe_push_to_hybrid_worklist): Don't sink gconds. > > > > (vect_slp_analyze_node_operations): Support gconds. > > > > (vect_slp_check_for_roots): Update comments. > > > > (vectorize_slp_instance_root_stmt): Support gconds. > > > > (vect_schedule_slp): Pass vinfo to vectorize_slp_instance_root_stmt. > > > > * tree-vect-stmts.cc (vect_stmt_relevant_p): Record early break live > > > > statements. > > > > (vectorizable_early_exit): Support SLP. > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > * gcc.dg/vect/vect-early-break_126.c: New test. > > > > * gcc.dg/vect/vect-early-break_127.c: New test. > > > > * gcc.dg/vect/vect-early-break_128.c: New test. > > > > > > > > --- > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c > > > b/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c > > > > new file mode 100644 > > > > index > > > > 0000000000000000000000000000000000000000..4bfc9880f9fc869bf616123 > > > ff509d13be17ffacf > > > > --- /dev/null > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c > > > > @@ -0,0 +1,28 @@ > > > > +/* { dg-do compile } */ > > > > +/* { dg-add-options vect_early_break } */ > > > > +/* { dg-require-effective-target vect_early_break } */ > > > > +/* { dg-require-effective-target vect_int } */ > > > > + > > > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > > > > +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ > > > > + > > > > +#define N 1024 > > > > +unsigned vect_a[N]; > > > > +unsigned vect_b[N]; > > > > + > > > > +unsigned test4(unsigned x) > > > > +{ > > > > + unsigned ret = 0; > > > > + for (int i = 0; i < N; i++) > > > > + { > > > > + vect_b[i] = x + i; > > > > + if (vect_a[i] > x) > > > > + { > > > > + ret *= vect_a[i]; > > > > + return vect_a[i]; > > > > + } > > > > + vect_a[i] = x; > > > > + ret += vect_a[i] + vect_b[i]; > > > > + } > > > > + return ret; > > > > +} > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c > > > b/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c > > > > new file mode 100644 > > > > index > > > > 0000000000000000000000000000000000000000..67cb5d34a77192e5d7d72 > > > c35df8e83535ef184ab > > > > --- /dev/null > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c > > > > @@ -0,0 +1,27 @@ > > > > +/* { dg-do compile } */ > > > > +/* { dg-add-options vect_early_break } */ > > > > +/* { dg-require-effective-target vect_early_break } */ > > > > +/* { dg-require-effective-target vect_int } */ > > > > + > > > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > > > > +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ > > > > + > > > > +#ifndef N > > > > +#define N 800 > > > > +#endif > > > > +unsigned vect_a[N]; > > > > +unsigned vect_b[N]; > > > > + > > > > +unsigned test4(unsigned x) > > > > +{ > > > > + unsigned ret = 0; > > > > + for (int i = 0; i < N; i++) > > > > + { > > > > + vect_b[i] = x + i; > > > > + if (vect_a[i]*2 != x) > > > > + break; > > > > + vect_a[i] = x; > > > > + > > > > + } > > > > + return ret; > > > > +} > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c > > > b/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c > > > > new file mode 100644 > > > > index > > > > 0000000000000000000000000000000000000000..6d7fb920ec2de529a4aa1d > > > e2c4a04286989204fd > > > > --- /dev/null > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c > > > > @@ -0,0 +1,31 @@ > > > > +/* { dg-do compile } */ > > > > +/* { dg-add-options vect_early_break } */ > > > > +/* { dg-require-effective-target vect_early_break } */ > > > > +/* { dg-require-effective-target vect_int } */ > > > > + > > > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > > > > +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ > > > > + > > > > +#ifndef N > > > > +#define N 800 > > > > +#endif > > > > +unsigned vect_a[N]; > > > > +unsigned vect_b[N]; > > > > + > > > > +unsigned test4(unsigned x) > > > > +{ > > > > + unsigned ret = 0; > > > > + for (int i = 0; i < N; i+=2) > > > > + { > > > > + vect_b[i] = x + i; > > > > + vect_b[i+1] = x + i+1; > > > > + if (vect_a[i]*2 != x) > > > > + break; > > > > + if (vect_a[i+1]*2 != x) > > > > + break; > > > > + vect_a[i] = x; > > > > + vect_a[i+1] = x; > > > > + > > > > + } > > > > + return ret; > > > > +} > > > > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > > > > index > > > > 600987dd6e5d506aa5fbb02350f9dab77793d382..7e765df466a59249feb999c > > > 24d8f2dad232948ae 100644 > > > > --- a/gcc/tree-vect-slp.cc > > > > +++ b/gcc/tree-vect-slp.cc > > > > @@ -3697,6 +3697,13 @@ vect_build_slp_instance (vec_info *vinfo, > > > > "Analyzing vectorizable constructor: %G\n", > > > > root_stmt_infos[0]->stmt); > > > > } > > > > + else if (kind == slp_inst_kind_gcond) > > > > + { > > > > + if (dump_enabled_p ()) > > > > + dump_printf_loc (MSG_NOTE, vect_location, > > > > + "Analyzing vectorizable control flow: %G", > > > > + root_stmt_infos[0]->stmt); > > > > + } > > > > > > > > if (dump_enabled_p ()) > > > > { > > > > @@ -4143,6 +4150,12 @@ vect_analyze_slp_instance (vec_info *vinfo, > > > > STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)) > > > > = STMT_VINFO_REDUC_DEF (vect_orig_stmt (scalar_stmts.last ())); > > > > } > > > > + else if (kind == slp_inst_kind_gcond) > > > > + { > > > > + /* Collect the stores and store them in scalar_stmts. */ > > > > + scalar_stmts.create (1); > > > > + scalar_stmts.quick_push (vect_stmt_to_vectorize (next_info)); > > > > + } > > > > > > We have this "left-over" dual API but since you never call > > > vect_analyze_slp_instance with slp_inst_kind_gcond but use > > > vect_build_slp_instance I think you can drop this hunk. > > > > > > > else > > > > gcc_unreachable (); > > > > > > > > @@ -4742,6 +4755,56 @@ vect_analyze_slp (vec_info *vinfo, unsigned > > > max_tree_size, > > > > bst_map, NULL, force_single_lane); > > > > } > > > > } > > > > + > > > > + /* Find SLP sequences starting from gconds. */ > > > > + for (auto cond : LOOP_VINFO_LOOP_CONDS (loop_vinfo)) > > > > + { > > > > + auto cond_info = loop_vinfo->lookup_stmt (cond); > > > > + vec<stmt_vec_info> stmts; > > > > + vec<stmt_vec_info> roots = vNULL; > > > > + vec<tree> remain = vNULL; > > > > + > > > > + cond_info = vect_stmt_to_vectorize (cond_info); > > > > + roots.safe_push (cond_info); > > > > + stmts.create (2); > > > > + tree args0 = gimple_cond_lhs (STMT_VINFO_STMT (cond_info)); > > > > + tree args1 = gimple_cond_rhs (STMT_VINFO_STMT (cond_info)); > > > > + /* An argument without a loop def will be codegened from vectorizing > > > the > > > > + root gcond itself. As such we don't need to try to build an SLP tree > > > > + from them. It's highly likely that the resulting SLP tree here if both > > > > + arguments have a def will be incompatible, but we rely on it being split > > > > + later on. */ > > > > + if (auto varg = loop_vinfo->lookup_def (args0)) > > > > + stmts.quick_push (vect_stmt_to_vectorize (varg)); > > > > + > > > > + if (auto varg = loop_vinfo->lookup_def (args1)) > > > > + stmts.quick_push (vect_stmt_to_vectorize (varg)); > > > > > > Hmm - you pattern-replace > > > > > > if (a != b) > > > > > > with > > > > > > patt = a != b; > > > if (patt != 0) > > > > > > correct? So I don't get why you push two scalar stmts in 'stmts' > > > and not just the pattern def of args0? Plus assert the condition > > > is NE_EXPR and the cond_rhs zero? (or simply guard discovery with > > > that - other gconds will simply not be vectorized) > > > > Yeah, the reason it can fail is that the gcond can be entirely loop invariant > > and the condition already a != 0, in which case gcond lowering would have > > done nothing. > > > > e.g. if (a != 0) where a is loop invariant. For instance test_memcmp_1_1 > > in /gcc.dg/memcmp-1.c is such loop. Technically we should be able to > > vectorize such loops, but while we can represent externals in the SLP tree, > > we can't start discovery at them, as no stmt_info for them. > > > > In principle all I need here is an empty SLP tree, since all codegen is driven > > by the roots for such invariant compares. However vect_build_slp_tree > > doesn't accept empty stmts. > > The externals would have SLP nodes of course but the requirement > currently is that the SLP instance root is an internal def. > > > I believe we are able to vectorize such loops today, so perhaps instead of > > failing we should support building an SLP instance with only roots? > > It might be tempting but I don't think this is generally useful. > > > In which case should I try to fit it into vect_build_slp_tree or just special > > case it for the gcond discovery? > > The issue is that you have two operands you technically would like to > see code-genrated - the 'a' and the '0' vector invariants, but the > SLP instance only has a single root. You could (as I suggested) > simply only build the SLP node for the (invariant) LHS of the gcond, > not by using vect_build_slp_tree but instead by manually building > the SLP tree for the invariant - see what vect_build_slp_tree_2 does Ah! That's what you meant with why am I building the SLP tree for a scalar. I didn't clearly understand what you meant there! Thanks! I can do so, It does mean that I have to then have a different approach for Invariants and loop defs, but that's simple. Will give it a try now. Thanks! Tamar > here: > > if (oprnd_info->first_dt == vect_external_def > || oprnd_info->first_dt == vect_constant_def) > { > if (!GET_MODE_SIZE (vinfo->vector_mode).is_constant ()) > { > tree op0; > tree uniform_val = op0 = oprnd_info->ops[0]; > for (j = 1; j < oprnd_info->ops.length (); ++j) > if (!operand_equal_p (uniform_val, oprnd_info->ops[j])) > { > uniform_val = NULL_TREE; > break; > } > if (!uniform_val > && !can_duplicate_and_interleave_p (vinfo, > > oprnd_info->ops.length (), > TREE_TYPE (op0))) > { > matches[j] = false; > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, > vect_location, > "Build SLP failed: invalid type of > def " > "for variable-length SLP %T\n", op0); > goto fail; > } > } > slp_tree invnode = vect_create_new_slp_node (oprnd_info->ops); > SLP_TREE_DEF_TYPE (invnode) = oprnd_info->first_dt; > oprnd_info->ops = vNULL; > children.safe_push (invnode); > continue; > > It should probably work fine then? > > Richard. > > > > > Thanks, > > Tamar > > > > > > That said, when loop_vinfo->lookup_def (args1) is not NULL > > > what you ask for doesn't really make sense - even if it discovers > > > OK (you get vectors with the condition lhs and rhs interleaved). > > > > > > > + if (!stmts.is_empty ()) > > > > + vect_build_slp_instance (vinfo, slp_inst_kind_gcond, > > > > + stmts, roots, remain, > > > > + max_tree_size, &limit, > > > > + bst_map, NULL, force_single_lane); > > > > + } > > > > + > > > > + /* Find and create slp instances for inductions that have been forced > > > > + live due to early break. */ > > > > + edge latch_e = loop_latch_edge (LOOP_VINFO_LOOP (loop_vinfo)); > > > > + for (auto stmt_info : LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS > > > (loop_vinfo)) > > > > + { > > > > + vec<stmt_vec_info> stmts; > > > > + vec<stmt_vec_info> roots = vNULL; > > > > + vec<tree> remain = vNULL; > > > > + gphi *lc_phi = as_a<gphi *> (STMT_VINFO_STMT (stmt_info)); > > > > + tree def = gimple_phi_arg_def_from_edge (lc_phi, latch_e); > > > > > > Hmm, so this isn't the LC PHI but the header PHI! So maybe better named > > > LOOP_VINFO_EARLY_BREAKS_LIVE_IVS? > > > > > > Note we're now walking all actual LC PHIs, but only when marked as > > > live. So maybe you can instead of adding to > > > LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS mark the defs appropriately? > > > Just a suggestion - I think what you do works as well. > > > > > > > + stmt_vec_info lc_info = loop_vinfo->lookup_def (def); > > > > + stmts.create (1); > > > > + stmts.quick_push (vect_stmt_to_vectorize (lc_info)); > > > > + vect_build_slp_instance (vinfo, slp_inst_kind_reduc_group, > > > > + stmts, roots, remain, > > > > + max_tree_size, &limit, > > > > + bst_map, NULL, force_single_lane); > > > > + } > > > > } > > > > > > > > hash_set<slp_tree> visited_patterns; > > > > @@ -7157,8 +7220,9 @@ maybe_push_to_hybrid_worklist (vec_info *vinfo, > > > > } > > > > } > > > > } > > > > - /* No def means this is a loo_vect sink. */ > > > > - if (!any_def) > > > > + /* No def means this is a loop_vect sink. Gimple conditionals also don't > have a > > > > + def but shouldn't be considered sinks. */ > > > > + if (!any_def && STMT_VINFO_DEF_TYPE (stmt_info) != vect_condition_def) > > > > { > > > > if (dump_enabled_p ()) > > > > dump_printf_loc (MSG_NOTE, vect_location, > > > > @@ -7542,9 +7606,27 @@ vect_slp_analyze_node_operations (vec_info > *vinfo, > > > slp_tree node, > > > > return true; > > > > visited_vec.safe_push (node); > > > > > > > > + /* If early break also check the root statement as we need to both analyze > > > > + and trigger codegen for it. The analysis will check whether can actually > > > > + vectorize it. At the memoment splitting off the analsysi bit from inside > > > > + it duplicates a lot of the setup code so it's not worth while to do so. > > > > + However when either the non-SLP loop vect goes away or we split > > > vectorizable_* > > > > + functions then we can call the analysis only part from here instead. */ > > > > bool res = true; > > > > - unsigned visited_rec_start = visited_vec.length (); > > > > unsigned cost_vec_rec_start = cost_vec->length (); > > > > + if (SLP_INSTANCE_KIND (node_instance) == slp_inst_kind_gcond) > > > > + { > > > > + auto root_stmt_info = SLP_INSTANCE_ROOT_STMTS (node_instance)[0]; > > > > + res = vectorizable_early_exit (vinfo, root_stmt_info, NULL, NULL, NULL, > > > > + cost_vec); > > > > > > Can you do this here instead: > > > > > > bool > > > vect_slp_analyze_operations (vec_info *vinfo) > > > { > > > ... > > > /* Check we can vectorize the reduction. */ > > > || (SLP_INSTANCE_KIND (instance) == slp_inst_kind_bb_reduc > > > && !vectorizable_bb_reduc_epilogue (instance, &cost_vec))) > > > > > > as we're checking the roots for slp_inst_kind_ctor and > > > slp_inst_kind_bb_reduc here already. > > > > > > > + if (!res) > > > > + { > > > > + cost_vec->truncate (cost_vec_rec_start); > > > > + return res; > > > > + } > > > > + } > > > > + > > > > + unsigned visited_rec_start = visited_vec.length (); > > > > bool seen_non_constant_child = false; > > > > FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child) > > > > { > > > > @@ -8612,6 +8694,8 @@ vect_slp_check_for_roots (bb_vec_info bb_vinfo) > > > > !gsi_end_p (gsi); gsi_next (&gsi)) > > > > { > > > > gassign *assign = dyn_cast<gassign *> (gsi_stmt (gsi)); > > > > + /* This can be used to start SLP discovery for early breaks for BB early > breaks > > > > + when we get that far. */ > > > > if (!assign) > > > > continue; > > > > > > > > @@ -10758,7 +10842,7 @@ vect_remove_slp_scalar_calls (vec_info *vinfo, > > > slp_tree node) > > > > /* Vectorize the instance root. */ > > > > > > > > void > > > > -vectorize_slp_instance_root_stmt (slp_tree node, slp_instance instance) > > > > +vectorize_slp_instance_root_stmt (vec_info *vinfo, slp_tree node, > slp_instance > > > instance) > > > > { > > > > gassign *rstmt = NULL; > > > > > > > > @@ -10862,6 +10946,21 @@ vectorize_slp_instance_root_stmt (slp_tree > node, > > > slp_instance instance) > > > > update_stmt (gsi_stmt (rgsi)); > > > > return; > > > > } > > > > + else if (instance->kind == slp_inst_kind_gcond) > > > > + { > > > > + /* Only support a single root for now as we can't codegen CFG yet and so > we > > > > + can't support lane > 1 at this time. */ > > > > + gcc_assert (instance->root_stmts.length () == 1); > > > > + auto root_stmt_info = instance->root_stmts[0]; > > > > + auto last_stmt = vect_find_first_scalar_stmt_in_slp (node)->stmt; > > > > + gimple_stmt_iterator rgsi = gsi_for_stmt (last_stmt); > > > > > > So last_stmt is actually the gcond (and thus root_stmt_info)? Why > > > not use that for the rgsi directly? > > > > > > > + gimple *vec_stmt = NULL; > > > > + gcc_assert (SLP_TREE_NUMBER_OF_VEC_STMTS (node) != 0); > > > > > > I guess you'd want to assert SLP_TREE_VEC_DEFS is not empty? > > > > > > > + bool res = vectorizable_early_exit (vinfo, root_stmt_info, &rgsi, > > > > + &vec_stmt, node, NULL); > > > > + gcc_assert (res); > > > > + return; > > > > + } > > > > else > > > > gcc_unreachable (); > > > > > > > > @@ -11080,7 +11179,7 @@ vect_schedule_slp (vec_info *vinfo, const > > > vec<slp_instance> &slp_instances) > > > > vect_schedule_scc (vinfo, node, instance, scc_info, maxdfs, stack); > > > > > > > > if (!SLP_INSTANCE_ROOT_STMTS (instance).is_empty ()) > > > > - vectorize_slp_instance_root_stmt (node, instance); > > > > + vectorize_slp_instance_root_stmt (vinfo, node, instance); > > > > > > > > if (dump_enabled_p ()) > > > > dump_printf_loc (MSG_NOTE, vect_location, > > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > > > > index > > > > b72b54d666879d8485f8d972b4e8d9dc64bc86b3..8f3f35989879199ffd0eb24 > > > 729cb7ade856a3c4d 100644 > > > > --- a/gcc/tree-vect-stmts.cc > > > > +++ b/gcc/tree-vect-stmts.cc > > > > @@ -411,6 +411,7 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, > > > loop_vec_info loop_vinfo, > > > > dump_printf_loc (MSG_NOTE, vect_location, > > > > "vec_stmt_relevant_p: induction forced for " > > > > "early break.\n"); > > > > + LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS (loop_vinfo).safe_push > > > (stmt_info); > > > > *live_p = true; > > > > > > ah, so it's even live already - what's the relevancy used? See above > > > for the SLP discovery question regarding to that we already walk all > > > actual LC PHI defs. > > > > > > Otherwise looks OK - thanks for the work! > > > > > > Richard. > > > > > > > > > > > } > > > > @@ -12933,7 +12934,7 @@ vectorizable_comparison (vec_info *vinfo, > > > > /* Check to see if the current early break given in STMT_INFO is valid for > > > > vectorization. */ > > > > > > > > -static bool > > > > +bool > > > > vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, > > > > gimple_stmt_iterator *gsi, gimple **vec_stmt, > > > > slp_tree slp_node, stmt_vector_for_cost *cost_vec) > > > > @@ -12958,7 +12959,7 @@ vectorizable_early_exit (vec_info *vinfo, > > > stmt_vec_info stmt_info, > > > > tree op0; > > > > enum vect_def_type dt0; > > > > if (!vect_is_simple_use (vinfo, stmt_info, slp_node, 0, &op0, &slp_op0, > &dt0, > > > > - &vectype)) > > > > + &vectype)) > > > > { > > > > if (dump_enabled_p ()) > > > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > > > @@ -12966,6 +12967,13 @@ vectorizable_early_exit (vec_info *vinfo, > > > stmt_vec_info stmt_info, > > > > return false; > > > > } > > > > > > > > + /* For SLP we don't want to use the type of the operands of the SLP node, > > > when > > > > + vectorizing using SLP slp_node will be the children of the gcond and we > want > > > to > > > > + use the type of the direct children which since the gcond is root will be the > > > > + current node, rather than a child node as vect_is_simple_use assumes. */ > > > > + if (slp_node) > > > > + vectype = SLP_TREE_VECTYPE (slp_node); > > > > + > > > > if (!vectype) > > > > return false; > > > > > > > > @@ -13060,9 +13068,18 @@ vectorizable_early_exit (vec_info *vinfo, > > > stmt_vec_info stmt_info, > > > > if (dump_enabled_p ()) > > > > dump_printf_loc (MSG_NOTE, vect_location, "transform early-exit.\n"); > > > > > > > > - if (!vectorizable_comparison_1 (vinfo, vectype, stmt_info, code, gsi, > > > > - vec_stmt, slp_node, cost_vec)) > > > > - gcc_unreachable (); > > > > + /* For SLP we don't do codegen of the body starting from the gcond, the > > > gconds are > > > > + roots and so by the time we get to them we have already codegened the > SLP > > > tree > > > > + and so we shouldn't try to do so again. The arguments have already been > > > > + vectorized. It's not very clean to do this here, But the masking code below > is > > > > + complex and this keeps it all in one place to ease fixes and backports. > Once > > > we > > > > + drop the non-SLP loop vect or split vectorizable_* this can be simplified. > */ > > > > + if (!slp_node) > > > > + { > > > > + if (!vectorizable_comparison_1 (vinfo, vectype, stmt_info, code, gsi, > > > > + vec_stmt, slp_node, cost_vec)) > > > > + gcc_unreachable (); > > > > + } > > > > > > > > gimple *stmt = STMT_VINFO_STMT (stmt_info); > > > > basic_block cond_bb = gimple_bb (stmt); > > > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h > > > > index > > > > 490061aea2f6d465d9589eb97bbd34a920d76b1c..53483303c4ac3482760fe72 > > > 2354f602e0243e5a2 100644 > > > > --- a/gcc/tree-vectorizer.h > > > > +++ b/gcc/tree-vectorizer.h > > > > @@ -252,7 +252,8 @@ enum slp_instance_kind { > > > > slp_inst_kind_reduc_group, > > > > slp_inst_kind_reduc_chain, > > > > slp_inst_kind_bb_reduc, > > > > - slp_inst_kind_ctor > > > > + slp_inst_kind_ctor, > > > > + slp_inst_kind_gcond > > > > }; > > > > > > > > /* SLP instance is a sequence of stmts in a loop that can be packed into > > > > @@ -977,6 +978,10 @@ public: > > > > /* Statements whose VUSES need updating if early break vectorization is to > > > > happen. */ > > > > auto_vec<gimple*> early_break_vuses; > > > > + > > > > + /* Record statements that are needed to be live for early break > vectorization > > > > + but may not have an LC PHI node materialized yet in the exits. */ > > > > + auto_vec<stmt_vec_info> early_break_live_stmts; > > > > } *loop_vec_info; > > > > > > > > /* Access Functions. */ > > > > @@ -1036,6 +1041,8 @@ public: > > > > #define LOOP_VINFO_EARLY_BRK_STORES(L) (L)->early_break_stores > > > > #define LOOP_VINFO_EARLY_BREAKS_VECT_PEELED(L) \ > > > > (single_pred ((L)->loop->latch) != (L)->vec_loop_iv_exit->src) > > > > +#define LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS(L) \ > > > > + (L)->early_break_live_stmts > > > > #define LOOP_VINFO_EARLY_BRK_DEST_BB(L) (L)->early_break_dest_bb > > > > #define LOOP_VINFO_EARLY_BRK_VUSES(L) (L)->early_break_vuses > > > > #define LOOP_VINFO_LOOP_CONDS(L) (L)->conds > > > > @@ -2522,6 +2529,9 @@ extern bool vectorizable_phi (vec_info *, > > > stmt_vec_info, gimple **, slp_tree, > > > > stmt_vector_for_cost *); > > > > extern bool vectorizable_recurr (loop_vec_info, stmt_vec_info, > > > > gimple **, slp_tree, stmt_vector_for_cost *); > > > > +extern bool vectorizable_early_exit (vec_info *, stmt_vec_info, > > > > + gimple_stmt_iterator *, gimple **, > > > > + slp_tree, stmt_vector_for_cost *); > > > > extern bool vect_emulated_vector_p (tree); > > > > extern bool vect_can_vectorize_without_simd_p (tree_code); > > > > extern bool vect_can_vectorize_without_simd_p (code_helper); > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Richard Biener <rguenther@suse.de> > > > SUSE Software Solutions Germany GmbH, > > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > Nuernberg) > > > > -- > Richard Biener <rguenther@suse.de> > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> > e.g. if (a != 0) where a is loop invariant. For instance test_memcmp_1_1 > > in /gcc.dg/memcmp-1.c is such loop. Technically we should be able to > > vectorize such loops, but while we can represent externals in the SLP tree, > > we can't start discovery at them, as no stmt_info for them. > > > > In principle all I need here is an empty SLP tree, since all codegen is driven > > by the roots for such invariant compares. However vect_build_slp_tree > > doesn't accept empty stmts. > > The externals would have SLP nodes of course but the requirement > currently is that the SLP instance root is an internal def. > > > I believe we are able to vectorize such loops today, so perhaps instead of > > failing we should support building an SLP instance with only roots? > > It might be tempting but I don't think this is generally useful. > > > In which case should I try to fit it into vect_build_slp_tree or just special > > case it for the gcond discovery? > > The issue is that you have two operands you technically would like to > see code-genrated - the 'a' and the '0' vector invariants, but the > SLP instance only has a single root. You could (as I suggested) > simply only build the SLP node for the (invariant) LHS of the gcond, > not by using vect_build_slp_tree but instead by manually building > the SLP tree for the invariant - see what vect_build_slp_tree_2 does > here: > Done, Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Will test more targets closer to commit. Ok for master? gcc/ChangeLog: * tree-vect-loop.cc (vect_analyze_loop_2): Handle SLP trees with no children. * tree-vectorizer.h (enum slp_instance_kind): Add slp_inst_kind_gcond. (LOOP_VINFO_EARLY_BREAKS_LIVE_IVS): New. (vectorizable_early_exit): Expose. (class _loop_vec_info): Add early_break_live_stmts. * tree-vect-slp.cc (vect_build_slp_instance, vect_analyze_slp_instance): Support gcond instances. (vect_analyze_slp): Analyze gcond roots and early break live statements. (maybe_push_to_hybrid_worklist): Don't sink gconds. (vect_slp_analyze_operations): Support gconds. (vect_slp_check_for_roots): Update comments. (vectorize_slp_instance_root_stmt): Support gconds. (vect_schedule_slp): Pass vinfo to vectorize_slp_instance_root_stmt. * tree-vect-stmts.cc (vect_stmt_relevant_p): Record early break live statements. (vectorizable_early_exit): Support SLP. gcc/testsuite/ChangeLog: * gcc.dg/vect/vect-early-break_126.c: New test. * gcc.dg/vect/vect-early-break_127.c: New test. * gcc.dg/vect/vect-early-break_128.c: New test. -- inline copy of patch -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c new file mode 100644 index 0000000000000000000000000000000000000000..4bfc9880f9fc869bf616123ff509d13be17ffacf --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c @@ -0,0 +1,28 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ + +#define N 1024 +unsigned vect_a[N]; +unsigned vect_b[N]; + +unsigned test4(unsigned x) +{ + unsigned ret = 0; + for (int i = 0; i < N; i++) + { + vect_b[i] = x + i; + if (vect_a[i] > x) + { + ret *= vect_a[i]; + return vect_a[i]; + } + vect_a[i] = x; + ret += vect_a[i] + vect_b[i]; + } + return ret; +} diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c new file mode 100644 index 0000000000000000000000000000000000000000..67cb5d34a77192e5d7d72c35df8e83535ef184ab --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c @@ -0,0 +1,27 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ + +#ifndef N +#define N 800 +#endif +unsigned vect_a[N]; +unsigned vect_b[N]; + +unsigned test4(unsigned x) +{ + unsigned ret = 0; + for (int i = 0; i < N; i++) + { + vect_b[i] = x + i; + if (vect_a[i]*2 != x) + break; + vect_a[i] = x; + + } + return ret; +} diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c new file mode 100644 index 0000000000000000000000000000000000000000..6d7fb920ec2de529a4aa1de2c4a04286989204fd --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c @@ -0,0 +1,31 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ + +#ifndef N +#define N 800 +#endif +unsigned vect_a[N]; +unsigned vect_b[N]; + +unsigned test4(unsigned x) +{ + unsigned ret = 0; + for (int i = 0; i < N; i+=2) + { + vect_b[i] = x + i; + vect_b[i+1] = x + i+1; + if (vect_a[i]*2 != x) + break; + if (vect_a[i+1]*2 != x) + break; + vect_a[i] = x; + vect_a[i+1] = x; + + } + return ret; +} diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 9be50aaa621c6f31d95b626042e780a3bcf05cf3..9292449d6811cebc2d1161653727161521916186 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -3258,6 +3258,10 @@ again: FOR_EACH_VEC_ELT (LOOP_VINFO_SLP_INSTANCES (loop_vinfo), i, instance) { stmt_vec_info vinfo; + auto scalar_stmt = SLP_TREE_SCALAR_STMTS (SLP_INSTANCE_TREE (instance)); + if (scalar_stmt.is_empty ()) + continue; + vinfo = SLP_TREE_SCALAR_STMTS (SLP_INSTANCE_TREE (instance))[0]; if (! STMT_VINFO_GROUPED_ACCESS (vinfo)) continue; diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 9bb765e2cbacfa2a6e8872c5e00d32954b3398ba..12acde895f70a3b84bd1359e9e3373564c311f0f 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -3758,6 +3758,13 @@ vect_build_slp_instance (vec_info *vinfo, "Analyzing vectorizable constructor: %G\n", root_stmt_infos[0]->stmt); } + else if (kind == slp_inst_kind_gcond) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "Analyzing vectorizable control flow: %G", + root_stmt_infos[0]->stmt); + } if (dump_enabled_p ()) { @@ -4824,6 +4831,80 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size, bst_map, NULL, force_single_lane); } } + + /* Find SLP sequences starting from gconds. */ + for (auto cond : LOOP_VINFO_LOOP_CONDS (loop_vinfo)) + { + auto cond_info = loop_vinfo->lookup_stmt (cond); + + cond_info = vect_stmt_to_vectorize (cond_info); + vec<stmt_vec_info> roots = vNULL; + roots.safe_push (cond_info); + gimple *stmt = STMT_VINFO_STMT (cond_info); + tree args0 = gimple_cond_lhs (stmt); + tree args1 = gimple_cond_rhs (stmt); + + /* These should be enforced by cond lowering. */ + gcc_assert (gimple_cond_code (stmt) == NE_EXPR); + gcc_assert (zerop (args1)); + + /* An argument without a loop def will be codegened from vectorizing the + root gcond itself. As such we don't need to try to build an SLP tree + from them. It's highly likely that the resulting SLP tree here if both + arguments have a def will be incompatible, but we rely on it being split + later on. */ + if (auto varg = loop_vinfo->lookup_def (args0)) + { + vec<stmt_vec_info> stmts; + vec<tree> remain = vNULL; + stmts.create (1); + stmts.quick_push (vect_stmt_to_vectorize (varg)); + + vect_build_slp_instance (vinfo, slp_inst_kind_gcond, + stmts, roots, remain, + max_tree_size, &limit, + bst_map, NULL, force_single_lane); + } + else + { + /* Create a new SLP instance. */ + slp_instance new_instance = XNEW (class _slp_instance); + vec<tree> ops; + ops.create (1); + ops.quick_push (args0); + slp_tree invnode = vect_create_new_slp_node (ops); + SLP_TREE_DEF_TYPE (invnode) = vect_external_def; + SLP_INSTANCE_TREE (new_instance) = invnode; + SLP_INSTANCE_UNROLLING_FACTOR (new_instance) = 1; + SLP_INSTANCE_LOADS (new_instance) = vNULL; + SLP_INSTANCE_ROOT_STMTS (new_instance) = roots; + SLP_INSTANCE_REMAIN_DEFS (new_instance) = vNULL; + SLP_INSTANCE_KIND (new_instance) = slp_inst_kind_gcond; + new_instance->reduc_phis = NULL; + new_instance->cost_vec = vNULL; + new_instance->subgraph_entries = vNULL; + vinfo->slp_instances.safe_push (new_instance); + } + } + + /* Find and create slp instances for inductions that have been forced + live due to early break. */ + edge latch_e = loop_latch_edge (LOOP_VINFO_LOOP (loop_vinfo)); + for (auto stmt_info : LOOP_VINFO_EARLY_BREAKS_LIVE_IVS (loop_vinfo)) + { + vec<stmt_vec_info> stmts; + vec<stmt_vec_info> roots = vNULL; + vec<tree> remain = vNULL; + gphi *lc_phi = as_a<gphi *> (STMT_VINFO_STMT (stmt_info)); + tree def = gimple_phi_arg_def_from_edge (lc_phi, latch_e); + stmt_vec_info lc_info = loop_vinfo->lookup_def (def); + stmts.create (1); + stmts.quick_push (vect_stmt_to_vectorize (lc_info)); + vect_build_slp_instance (vinfo, slp_inst_kind_reduc_group, + stmts, roots, remain, + max_tree_size, &limit, + bst_map, NULL, force_single_lane); + } } hash_set<slp_tree> visited_patterns; @@ -7239,8 +7320,9 @@ maybe_push_to_hybrid_worklist (vec_info *vinfo, } } } - /* No def means this is a loo_vect sink. */ - if (!any_def) + /* No def means this is a loop_vect sink. Gimple conditionals also don't have a + def but shouldn't be considered sinks. */ + if (!any_def && STMT_VINFO_DEF_TYPE (stmt_info) != vect_condition_def) { if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, @@ -8064,7 +8146,14 @@ vect_slp_analyze_operations (vec_info *vinfo) (SLP_INSTANCE_TREE (instance)))))) /* Check we can vectorize the reduction. */ || (SLP_INSTANCE_KIND (instance) == slp_inst_kind_bb_reduc - && !vectorizable_bb_reduc_epilogue (instance, &cost_vec))) + && !vectorizable_bb_reduc_epilogue (instance, &cost_vec)) + /* Check we can vectorize the gcond. */ + || (SLP_INSTANCE_KIND (instance) == slp_inst_kind_gcond + && !vectorizable_early_exit (vinfo, + SLP_INSTANCE_ROOT_STMTS (instance)[0], + NULL, NULL, + SLP_INSTANCE_TREE (instance), + &cost_vec))) { cost_vec.release (); slp_tree node = SLP_INSTANCE_TREE (instance); @@ -8694,6 +8783,8 @@ vect_slp_check_for_roots (bb_vec_info bb_vinfo) !gsi_end_p (gsi); gsi_next (&gsi)) { gassign *assign = dyn_cast<gassign *> (gsi_stmt (gsi)); + /* This can be used to start SLP discovery for early breaks for BB early breaks + when we get that far. */ if (!assign) continue; @@ -10912,7 +11003,7 @@ vect_remove_slp_scalar_calls (vec_info *vinfo, slp_tree node) /* Vectorize the instance root. */ void -vectorize_slp_instance_root_stmt (slp_tree node, slp_instance instance) +vectorize_slp_instance_root_stmt (vec_info *vinfo, slp_tree node, slp_instance instance) { gassign *rstmt = NULL; @@ -11016,6 +11107,21 @@ vectorize_slp_instance_root_stmt (slp_tree node, slp_instance instance) update_stmt (gsi_stmt (rgsi)); return; } + else if (instance->kind == slp_inst_kind_gcond) + { + /* Only support a single root for now as we can't codegen CFG yet and so we + can't support lane > 1 at this time. */ + gcc_assert (instance->root_stmts.length () == 1); + auto root_stmt_info = instance->root_stmts[0]; + auto last_stmt = STMT_VINFO_STMT (root_stmt_info); + gimple_stmt_iterator rgsi = gsi_for_stmt (last_stmt); + gimple *vec_stmt = NULL; + gcc_assert (!SLP_TREE_VEC_DEFS (node).is_empty ()); + bool res = vectorizable_early_exit (vinfo, root_stmt_info, &rgsi, + &vec_stmt, node, NULL); + gcc_assert (res); + return; + } else gcc_unreachable (); @@ -11234,7 +11340,7 @@ vect_schedule_slp (vec_info *vinfo, const vec<slp_instance> &slp_instances) vect_schedule_scc (vinfo, node, instance, scc_info, maxdfs, stack); if (!SLP_INSTANCE_ROOT_STMTS (instance).is_empty ()) - vectorize_slp_instance_root_stmt (node, instance); + vectorize_slp_instance_root_stmt (vinfo, node, instance); if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index 43358767934612a9788c587f35f66e007faa8863..bdfc40369a12fb8a15ee9613324c10ad51b31583 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -411,6 +411,7 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, loop_vec_info loop_vinfo, dump_printf_loc (MSG_NOTE, vect_location, "vec_stmt_relevant_p: induction forced for " "early break.\n"); + LOOP_VINFO_EARLY_BREAKS_LIVE_IVS (loop_vinfo).safe_push (stmt_info); *live_p = true; } @@ -13004,7 +13005,7 @@ vectorizable_comparison (vec_info *vinfo, /* Check to see if the current early break given in STMT_INFO is valid for vectorization. */ -static bool +bool vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, gimple_stmt_iterator *gsi, gimple **vec_stmt, slp_tree slp_node, stmt_vector_for_cost *cost_vec) @@ -13028,8 +13029,13 @@ vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, slp_tree slp_op0; tree op0; enum vect_def_type dt0; - if (!vect_is_simple_use (vinfo, stmt_info, slp_node, 0, &op0, &slp_op0, &dt0, - &vectype)) + + /* Early break gcond kind SLP trees can be root only and have no children, + for instance in the case where the argument is an external. If that's + the case there is no operand to analyse use of. */ + if ((!slp_node || !SLP_TREE_CHILDREN (slp_node).is_empty ()) + && !vect_is_simple_use (vinfo, stmt_info, slp_node, 0, &op0, &slp_op0, &dt0, + &vectype)) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, @@ -13037,16 +13043,30 @@ vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, return false; } + /* For SLP we don't want to use the type of the operands of the SLP node, when + vectorizing using SLP slp_node will be the children of the gcond and we + want to use the type of the direct children which since the gcond is root + will be the current node, rather than a child node as vect_is_simple_use + assumes. */ + if (slp_node) + vectype = SLP_TREE_VECTYPE (slp_node); + if (!vectype) return false; machine_mode mode = TYPE_MODE (vectype); - int ncopies; + int ncopies, vec_num; if (slp_node) - ncopies = 1; + { + ncopies = 1; + vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); + } else - ncopies = vect_get_num_copies (loop_vinfo, vectype); + { + ncopies = vect_get_num_copies (loop_vinfo, vectype); + vec_num = 1; + } vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo); vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); @@ -13115,9 +13135,11 @@ vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, { if (direct_internal_fn_supported_p (IFN_VCOND_MASK_LEN, vectype, OPTIMIZE_FOR_SPEED)) - vect_record_loop_len (loop_vinfo, lens, ncopies, vectype, 1); + vect_record_loop_len (loop_vinfo, lens, ncopies * vec_num, + vectype, 1); else - vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype, NULL); + vect_record_loop_mask (loop_vinfo, masks, ncopies * vec_num, + vectype, NULL); } return true; @@ -13131,9 +13153,18 @@ vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "transform early-exit.\n"); - if (!vectorizable_comparison_1 (vinfo, vectype, stmt_info, code, gsi, - vec_stmt, slp_node, cost_vec)) - gcc_unreachable (); + /* For SLP we don't do codegen of the body starting from the gcond, the gconds are + roots and so by the time we get to them we have already codegened the SLP tree + and so we shouldn't try to do so again. The arguments have already been + vectorized. It's not very clean to do this here, But the masking code below is + complex and this keeps it all in one place to ease fixes and backports. Once we + drop the non-SLP loop vect or split vectorizable_* this can be simplified. */ + if (!slp_node) + { + if (!vectorizable_comparison_1 (vinfo, vectype, stmt_info, code, gsi, + vec_stmt, slp_node, cost_vec)) + gcc_unreachable (); + } gimple *stmt = STMT_VINFO_STMT (stmt_info); basic_block cond_bb = gimple_bb (stmt); @@ -13165,8 +13196,8 @@ vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, for (unsigned i = 0; i < stmts.length (); i++) { tree stmt_mask - = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies, vectype, - i); + = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies * vec_num, + vectype, i); stmt_mask = prepare_vec_mask (loop_vinfo, TREE_TYPE (stmt_mask), stmt_mask, stmts[i], &cond_gsi); @@ -13176,8 +13207,8 @@ vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, for (unsigned i = 0; i < stmts.length (); i++) { tree len_mask = vect_gen_loop_len_mask (loop_vinfo, gsi, &cond_gsi, - lens, ncopies, vectype, - stmts[i], i, 1); + lens, ncopies * vec_num, + vectype, stmts[i], i, 1); workset.quick_push (len_mask); } diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 73bccb5a40a8be4d3f5f600239d63c756d2e35e7..5028dd154d1c114c6ca3a4c6bf068cffef3aaa6c 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -252,7 +252,8 @@ enum slp_instance_kind { slp_inst_kind_reduc_group, slp_inst_kind_reduc_chain, slp_inst_kind_bb_reduc, - slp_inst_kind_ctor + slp_inst_kind_ctor, + slp_inst_kind_gcond }; /* SLP instance is a sequence of stmts in a loop that can be packed into @@ -977,6 +978,10 @@ public: /* Statements whose VUSES need updating if early break vectorization is to happen. */ auto_vec<gimple*> early_break_vuses; + + /* Record statements that are needed to be live for early break vectorization + but may not have an LC PHI node materialized yet in the exits. */ + auto_vec<stmt_vec_info> early_break_live_ivs; } *loop_vec_info; /* Access Functions. */ @@ -1036,6 +1041,8 @@ public: #define LOOP_VINFO_EARLY_BRK_STORES(L) (L)->early_break_stores #define LOOP_VINFO_EARLY_BREAKS_VECT_PEELED(L) \ (single_pred ((L)->loop->latch) != (L)->vec_loop_iv_exit->src) +#define LOOP_VINFO_EARLY_BREAKS_LIVE_IVS(L) \ + (L)->early_break_live_ivs #define LOOP_VINFO_EARLY_BRK_DEST_BB(L) (L)->early_break_dest_bb #define LOOP_VINFO_EARLY_BRK_VUSES(L) (L)->early_break_vuses #define LOOP_VINFO_LOOP_CONDS(L) (L)->conds @@ -2524,6 +2531,9 @@ extern bool vectorizable_phi (vec_info *, stmt_vec_info, gimple **, slp_tree, stmt_vector_for_cost *); extern bool vectorizable_recurr (loop_vec_info, stmt_vec_info, gimple **, slp_tree, stmt_vector_for_cost *); +extern bool vectorizable_early_exit (vec_info *, stmt_vec_info, + gimple_stmt_iterator *, gimple **, + slp_tree, stmt_vector_for_cost *); extern bool vect_emulated_vector_p (tree); extern bool vect_can_vectorize_without_simd_p (tree_code); extern bool vect_can_vectorize_without_simd_p (code_helper);
On Thu, 10 Oct 2024, Tamar Christina wrote: > > > e.g. if (a != 0) where a is loop invariant. For instance test_memcmp_1_1 > > > in /gcc.dg/memcmp-1.c is such loop. Technically we should be able to > > > vectorize such loops, but while we can represent externals in the SLP tree, > > > we can't start discovery at them, as no stmt_info for them. > > > > > > In principle all I need here is an empty SLP tree, since all codegen is driven > > > by the roots for such invariant compares. However vect_build_slp_tree > > > doesn't accept empty stmts. > > > > The externals would have SLP nodes of course but the requirement > > currently is that the SLP instance root is an internal def. > > > > > I believe we are able to vectorize such loops today, so perhaps instead of > > > failing we should support building an SLP instance with only roots? > > > > It might be tempting but I don't think this is generally useful. > > > > > In which case should I try to fit it into vect_build_slp_tree or just special > > > case it for the gcond discovery? > > > > The issue is that you have two operands you technically would like to > > see code-genrated - the 'a' and the '0' vector invariants, but the > > SLP instance only has a single root. You could (as I suggested) > > simply only build the SLP node for the (invariant) LHS of the gcond, > > not by using vect_build_slp_tree but instead by manually building > > the SLP tree for the invariant - see what vect_build_slp_tree_2 does > > here: > > > > Done, > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > Will test more targets closer to commit. > > Ok for master? > > gcc/ChangeLog: > > * tree-vect-loop.cc (vect_analyze_loop_2): Handle SLP trees with no > children. > * tree-vectorizer.h (enum slp_instance_kind): Add slp_inst_kind_gcond. > (LOOP_VINFO_EARLY_BREAKS_LIVE_IVS): New. > (vectorizable_early_exit): Expose. > (class _loop_vec_info): Add early_break_live_stmts. > * tree-vect-slp.cc (vect_build_slp_instance, vect_analyze_slp_instance): > Support gcond instances. > (vect_analyze_slp): Analyze gcond roots and early break live statements. > (maybe_push_to_hybrid_worklist): Don't sink gconds. > (vect_slp_analyze_operations): Support gconds. > (vect_slp_check_for_roots): Update comments. > (vectorize_slp_instance_root_stmt): Support gconds. > (vect_schedule_slp): Pass vinfo to vectorize_slp_instance_root_stmt. > * tree-vect-stmts.cc (vect_stmt_relevant_p): Record early break live > statements. > (vectorizable_early_exit): Support SLP. > > gcc/testsuite/ChangeLog: > > * gcc.dg/vect/vect-early-break_126.c: New test. > * gcc.dg/vect/vect-early-break_127.c: New test. > * gcc.dg/vect/vect-early-break_128.c: New test. > > -- inline copy of patch -- > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c > new file mode 100644 > index 0000000000000000000000000000000000000000..4bfc9880f9fc869bf616123ff509d13be17ffacf > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c > @@ -0,0 +1,28 @@ > +/* { dg-do compile } */ > +/* { dg-add-options vect_early_break } */ > +/* { dg-require-effective-target vect_early_break } */ > +/* { dg-require-effective-target vect_int } */ > + > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ > + > +#define N 1024 > +unsigned vect_a[N]; > +unsigned vect_b[N]; > + > +unsigned test4(unsigned x) > +{ > + unsigned ret = 0; > + for (int i = 0; i < N; i++) > + { > + vect_b[i] = x + i; > + if (vect_a[i] > x) > + { > + ret *= vect_a[i]; > + return vect_a[i]; > + } > + vect_a[i] = x; > + ret += vect_a[i] + vect_b[i]; > + } > + return ret; > +} > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c > new file mode 100644 > index 0000000000000000000000000000000000000000..67cb5d34a77192e5d7d72c35df8e83535ef184ab > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c > @@ -0,0 +1,27 @@ > +/* { dg-do compile } */ > +/* { dg-add-options vect_early_break } */ > +/* { dg-require-effective-target vect_early_break } */ > +/* { dg-require-effective-target vect_int } */ > + > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ > + > +#ifndef N > +#define N 800 > +#endif > +unsigned vect_a[N]; > +unsigned vect_b[N]; > + > +unsigned test4(unsigned x) > +{ > + unsigned ret = 0; > + for (int i = 0; i < N; i++) > + { > + vect_b[i] = x + i; > + if (vect_a[i]*2 != x) > + break; > + vect_a[i] = x; > + > + } > + return ret; > +} > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c > new file mode 100644 > index 0000000000000000000000000000000000000000..6d7fb920ec2de529a4aa1de2c4a04286989204fd > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c > @@ -0,0 +1,31 @@ > +/* { dg-do compile } */ > +/* { dg-add-options vect_early_break } */ > +/* { dg-require-effective-target vect_early_break } */ > +/* { dg-require-effective-target vect_int } */ > + > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ > + > +#ifndef N > +#define N 800 > +#endif > +unsigned vect_a[N]; > +unsigned vect_b[N]; > + > +unsigned test4(unsigned x) > +{ > + unsigned ret = 0; > + for (int i = 0; i < N; i+=2) > + { > + vect_b[i] = x + i; > + vect_b[i+1] = x + i+1; > + if (vect_a[i]*2 != x) > + break; > + if (vect_a[i+1]*2 != x) > + break; > + vect_a[i] = x; > + vect_a[i+1] = x; > + > + } > + return ret; > +} > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > index 9be50aaa621c6f31d95b626042e780a3bcf05cf3..9292449d6811cebc2d1161653727161521916186 100644 > --- a/gcc/tree-vect-loop.cc > +++ b/gcc/tree-vect-loop.cc > @@ -3258,6 +3258,10 @@ again: > FOR_EACH_VEC_ELT (LOOP_VINFO_SLP_INSTANCES (loop_vinfo), i, instance) > { > stmt_vec_info vinfo; > + auto scalar_stmt = SLP_TREE_SCALAR_STMTS (SLP_INSTANCE_TREE (instance)); > + if (scalar_stmt.is_empty ()) > + continue; > + Ah, this whole code blob will probably go away. I'd prefer if (SLP_TREE_DEF_TYPE (SLP_INSTANCE_TREE (instance)) != vect_internal_def) continue; > vinfo = SLP_TREE_SCALAR_STMTS (SLP_INSTANCE_TREE (instance))[0]; > if (! STMT_VINFO_GROUPED_ACCESS (vinfo)) > continue; > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > index 9bb765e2cbacfa2a6e8872c5e00d32954b3398ba..12acde895f70a3b84bd1359e9e3373564c311f0f 100644 > --- a/gcc/tree-vect-slp.cc > +++ b/gcc/tree-vect-slp.cc > @@ -3758,6 +3758,13 @@ vect_build_slp_instance (vec_info *vinfo, > "Analyzing vectorizable constructor: %G\n", > root_stmt_infos[0]->stmt); > } > + else if (kind == slp_inst_kind_gcond) > + { > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_NOTE, vect_location, > + "Analyzing vectorizable control flow: %G", > + root_stmt_infos[0]->stmt); > + } > > if (dump_enabled_p ()) > { > @@ -4824,6 +4831,80 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size, > bst_map, NULL, force_single_lane); > } > } > + > + /* Find SLP sequences starting from gconds. */ > + for (auto cond : LOOP_VINFO_LOOP_CONDS (loop_vinfo)) > + { > + auto cond_info = loop_vinfo->lookup_stmt (cond); > + > + cond_info = vect_stmt_to_vectorize (cond_info); > + vec<stmt_vec_info> roots = vNULL; > + roots.safe_push (cond_info); > + gimple *stmt = STMT_VINFO_STMT (cond_info); > + tree args0 = gimple_cond_lhs (stmt); > + tree args1 = gimple_cond_rhs (stmt); > + > + /* These should be enforced by cond lowering. */ > + gcc_assert (gimple_cond_code (stmt) == NE_EXPR); > + gcc_assert (zerop (args1)); > + > + /* An argument without a loop def will be codegened from vectorizing the > + root gcond itself. As such we don't need to try to build an SLP tree > + from them. It's highly likely that the resulting SLP tree here if both > + arguments have a def will be incompatible, but we rely on it being split > + later on. */ > + if (auto varg = loop_vinfo->lookup_def (args0)) > + { > + vec<stmt_vec_info> stmts; > + vec<tree> remain = vNULL; > + stmts.create (1); > + stmts.quick_push (vect_stmt_to_vectorize (varg)); > + > + vect_build_slp_instance (vinfo, slp_inst_kind_gcond, > + stmts, roots, remain, > + max_tree_size, &limit, > + bst_map, NULL, force_single_lane); > + } > + else > + { > + /* Create a new SLP instance. */ > + slp_instance new_instance = XNEW (class _slp_instance); > + vec<tree> ops; > + ops.create (1); > + ops.quick_push (args0); > + slp_tree invnode = vect_create_new_slp_node (ops); > + SLP_TREE_DEF_TYPE (invnode) = vect_external_def; > + SLP_INSTANCE_TREE (new_instance) = invnode; > + SLP_INSTANCE_UNROLLING_FACTOR (new_instance) = 1; > + SLP_INSTANCE_LOADS (new_instance) = vNULL; > + SLP_INSTANCE_ROOT_STMTS (new_instance) = roots; > + SLP_INSTANCE_REMAIN_DEFS (new_instance) = vNULL; > + SLP_INSTANCE_KIND (new_instance) = slp_inst_kind_gcond; > + new_instance->reduc_phis = NULL; > + new_instance->cost_vec = vNULL; > + new_instance->subgraph_entries = vNULL; > + vinfo->slp_instances.safe_push (new_instance); > + } > + } > + > + /* Find and create slp instances for inductions that have been forced > + live due to early break. */ > + edge latch_e = loop_latch_edge (LOOP_VINFO_LOOP (loop_vinfo)); > + for (auto stmt_info : LOOP_VINFO_EARLY_BREAKS_LIVE_IVS (loop_vinfo)) > + { > + vec<stmt_vec_info> stmts; > + vec<stmt_vec_info> roots = vNULL; > + vec<tree> remain = vNULL; > + gphi *lc_phi = as_a<gphi *> (STMT_VINFO_STMT (stmt_info)); > + tree def = gimple_phi_arg_def_from_edge (lc_phi, latch_e); > + stmt_vec_info lc_info = loop_vinfo->lookup_def (def); > + stmts.create (1); > + stmts.quick_push (vect_stmt_to_vectorize (lc_info)); > + vect_build_slp_instance (vinfo, slp_inst_kind_reduc_group, > + stmts, roots, remain, > + max_tree_size, &limit, > + bst_map, NULL, force_single_lane); > + } > } > > hash_set<slp_tree> visited_patterns; > @@ -7239,8 +7320,9 @@ maybe_push_to_hybrid_worklist (vec_info *vinfo, > } > } > } > - /* No def means this is a loo_vect sink. */ > - if (!any_def) > + /* No def means this is a loop_vect sink. Gimple conditionals also don't have a > + def but shouldn't be considered sinks. */ > + if (!any_def && STMT_VINFO_DEF_TYPE (stmt_info) != vect_condition_def) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_NOTE, vect_location, > @@ -8064,7 +8146,14 @@ vect_slp_analyze_operations (vec_info *vinfo) > (SLP_INSTANCE_TREE (instance)))))) > /* Check we can vectorize the reduction. */ > || (SLP_INSTANCE_KIND (instance) == slp_inst_kind_bb_reduc > - && !vectorizable_bb_reduc_epilogue (instance, &cost_vec))) > + && !vectorizable_bb_reduc_epilogue (instance, &cost_vec)) > + /* Check we can vectorize the gcond. */ > + || (SLP_INSTANCE_KIND (instance) == slp_inst_kind_gcond > + && !vectorizable_early_exit (vinfo, > + SLP_INSTANCE_ROOT_STMTS (instance)[0], > + NULL, NULL, > + SLP_INSTANCE_TREE (instance), > + &cost_vec))) > { > cost_vec.release (); > slp_tree node = SLP_INSTANCE_TREE (instance); > @@ -8694,6 +8783,8 @@ vect_slp_check_for_roots (bb_vec_info bb_vinfo) > !gsi_end_p (gsi); gsi_next (&gsi)) > { > gassign *assign = dyn_cast<gassign *> (gsi_stmt (gsi)); > + /* This can be used to start SLP discovery for early breaks for BB early breaks > + when we get that far. */ > if (!assign) > continue; > > @@ -10912,7 +11003,7 @@ vect_remove_slp_scalar_calls (vec_info *vinfo, slp_tree node) > /* Vectorize the instance root. */ > > void > -vectorize_slp_instance_root_stmt (slp_tree node, slp_instance instance) > +vectorize_slp_instance_root_stmt (vec_info *vinfo, slp_tree node, slp_instance instance) > { > gassign *rstmt = NULL; > > @@ -11016,6 +11107,21 @@ vectorize_slp_instance_root_stmt (slp_tree node, slp_instance instance) > update_stmt (gsi_stmt (rgsi)); > return; > } > + else if (instance->kind == slp_inst_kind_gcond) > + { > + /* Only support a single root for now as we can't codegen CFG yet and so we > + can't support lane > 1 at this time. */ > + gcc_assert (instance->root_stmts.length () == 1); > + auto root_stmt_info = instance->root_stmts[0]; > + auto last_stmt = STMT_VINFO_STMT (root_stmt_info); > + gimple_stmt_iterator rgsi = gsi_for_stmt (last_stmt); > + gimple *vec_stmt = NULL; > + gcc_assert (!SLP_TREE_VEC_DEFS (node).is_empty ()); > + bool res = vectorizable_early_exit (vinfo, root_stmt_info, &rgsi, > + &vec_stmt, node, NULL); > + gcc_assert (res); > + return; > + } > else > gcc_unreachable (); > > @@ -11234,7 +11340,7 @@ vect_schedule_slp (vec_info *vinfo, const vec<slp_instance> &slp_instances) > vect_schedule_scc (vinfo, node, instance, scc_info, maxdfs, stack); > > if (!SLP_INSTANCE_ROOT_STMTS (instance).is_empty ()) > - vectorize_slp_instance_root_stmt (node, instance); > + vectorize_slp_instance_root_stmt (vinfo, node, instance); > > if (dump_enabled_p ()) > dump_printf_loc (MSG_NOTE, vect_location, > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > index 43358767934612a9788c587f35f66e007faa8863..bdfc40369a12fb8a15ee9613324c10ad51b31583 100644 > --- a/gcc/tree-vect-stmts.cc > +++ b/gcc/tree-vect-stmts.cc > @@ -411,6 +411,7 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, loop_vec_info loop_vinfo, > dump_printf_loc (MSG_NOTE, vect_location, > "vec_stmt_relevant_p: induction forced for " > "early break.\n"); > + LOOP_VINFO_EARLY_BREAKS_LIVE_IVS (loop_vinfo).safe_push (stmt_info); > *live_p = true; > > } > @@ -13004,7 +13005,7 @@ vectorizable_comparison (vec_info *vinfo, > /* Check to see if the current early break given in STMT_INFO is valid for > vectorization. */ > > -static bool > +bool > vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, > gimple_stmt_iterator *gsi, gimple **vec_stmt, > slp_tree slp_node, stmt_vector_for_cost *cost_vec) > @@ -13028,8 +13029,13 @@ vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, > slp_tree slp_op0; > tree op0; > enum vect_def_type dt0; > - if (!vect_is_simple_use (vinfo, stmt_info, slp_node, 0, &op0, &slp_op0, &dt0, > - &vectype)) > + > + /* Early break gcond kind SLP trees can be root only and have no children, > + for instance in the case where the argument is an external. If that's > + the case there is no operand to analyse use of. */ > + if ((!slp_node || !SLP_TREE_CHILDREN (slp_node).is_empty ()) > + && !vect_is_simple_use (vinfo, stmt_info, slp_node, 0, &op0, &slp_op0, &dt0, > + &vectype)) Hmm, but doesn't this somehow show we're mixing things up here? stmt_info seems to be the gcond but slp_node is for the pattern condition (or alternatively the invariant condition). But yeah, I guess that's how we set this up. Unless you want to create the pattern also for invariant conditions (why not?). I think the patch is good to go with the minor adjustment in the first hunk. Thanks, Richard. > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > @@ -13037,16 +13043,30 @@ vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, > return false; > } > > + /* For SLP we don't want to use the type of the operands of the SLP node, when > + vectorizing using SLP slp_node will be the children of the gcond and we > + want to use the type of the direct children which since the gcond is root > + will be the current node, rather than a child node as vect_is_simple_use > + assumes. */ > + if (slp_node) > + vectype = SLP_TREE_VECTYPE (slp_node); > + > if (!vectype) > return false; > > machine_mode mode = TYPE_MODE (vectype); > - int ncopies; > + int ncopies, vec_num; > > if (slp_node) > - ncopies = 1; > + { > + ncopies = 1; > + vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); > + } > else > - ncopies = vect_get_num_copies (loop_vinfo, vectype); > + { > + ncopies = vect_get_num_copies (loop_vinfo, vectype); > + vec_num = 1; > + } > > vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo); > vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); > @@ -13115,9 +13135,11 @@ vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, > { > if (direct_internal_fn_supported_p (IFN_VCOND_MASK_LEN, vectype, > OPTIMIZE_FOR_SPEED)) > - vect_record_loop_len (loop_vinfo, lens, ncopies, vectype, 1); > + vect_record_loop_len (loop_vinfo, lens, ncopies * vec_num, > + vectype, 1); > else > - vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype, NULL); > + vect_record_loop_mask (loop_vinfo, masks, ncopies * vec_num, > + vectype, NULL); > } > > return true; > @@ -13131,9 +13153,18 @@ vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, > if (dump_enabled_p ()) > dump_printf_loc (MSG_NOTE, vect_location, "transform early-exit.\n"); > > - if (!vectorizable_comparison_1 (vinfo, vectype, stmt_info, code, gsi, > - vec_stmt, slp_node, cost_vec)) > - gcc_unreachable (); > + /* For SLP we don't do codegen of the body starting from the gcond, the gconds are > + roots and so by the time we get to them we have already codegened the SLP tree > + and so we shouldn't try to do so again. The arguments have already been > + vectorized. It's not very clean to do this here, But the masking code below is > + complex and this keeps it all in one place to ease fixes and backports. Once we > + drop the non-SLP loop vect or split vectorizable_* this can be simplified. */ > + if (!slp_node) > + { > + if (!vectorizable_comparison_1 (vinfo, vectype, stmt_info, code, gsi, > + vec_stmt, slp_node, cost_vec)) > + gcc_unreachable (); > + } > > gimple *stmt = STMT_VINFO_STMT (stmt_info); > basic_block cond_bb = gimple_bb (stmt); > @@ -13165,8 +13196,8 @@ vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, > for (unsigned i = 0; i < stmts.length (); i++) > { > tree stmt_mask > - = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies, vectype, > - i); > + = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies * vec_num, > + vectype, i); > stmt_mask > = prepare_vec_mask (loop_vinfo, TREE_TYPE (stmt_mask), stmt_mask, > stmts[i], &cond_gsi); > @@ -13176,8 +13207,8 @@ vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, > for (unsigned i = 0; i < stmts.length (); i++) > { > tree len_mask = vect_gen_loop_len_mask (loop_vinfo, gsi, &cond_gsi, > - lens, ncopies, vectype, > - stmts[i], i, 1); > + lens, ncopies * vec_num, > + vectype, stmts[i], i, 1); > > workset.quick_push (len_mask); > } > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h > index 73bccb5a40a8be4d3f5f600239d63c756d2e35e7..5028dd154d1c114c6ca3a4c6bf068cffef3aaa6c 100644 > --- a/gcc/tree-vectorizer.h > +++ b/gcc/tree-vectorizer.h > @@ -252,7 +252,8 @@ enum slp_instance_kind { > slp_inst_kind_reduc_group, > slp_inst_kind_reduc_chain, > slp_inst_kind_bb_reduc, > - slp_inst_kind_ctor > + slp_inst_kind_ctor, > + slp_inst_kind_gcond > }; > > /* SLP instance is a sequence of stmts in a loop that can be packed into > @@ -977,6 +978,10 @@ public: > /* Statements whose VUSES need updating if early break vectorization is to > happen. */ > auto_vec<gimple*> early_break_vuses; > + > + /* Record statements that are needed to be live for early break vectorization > + but may not have an LC PHI node materialized yet in the exits. */ > + auto_vec<stmt_vec_info> early_break_live_ivs; > } *loop_vec_info; > > /* Access Functions. */ > @@ -1036,6 +1041,8 @@ public: > #define LOOP_VINFO_EARLY_BRK_STORES(L) (L)->early_break_stores > #define LOOP_VINFO_EARLY_BREAKS_VECT_PEELED(L) \ > (single_pred ((L)->loop->latch) != (L)->vec_loop_iv_exit->src) > +#define LOOP_VINFO_EARLY_BREAKS_LIVE_IVS(L) \ > + (L)->early_break_live_ivs > #define LOOP_VINFO_EARLY_BRK_DEST_BB(L) (L)->early_break_dest_bb > #define LOOP_VINFO_EARLY_BRK_VUSES(L) (L)->early_break_vuses > #define LOOP_VINFO_LOOP_CONDS(L) (L)->conds > @@ -2524,6 +2531,9 @@ extern bool vectorizable_phi (vec_info *, stmt_vec_info, gimple **, slp_tree, > stmt_vector_for_cost *); > extern bool vectorizable_recurr (loop_vec_info, stmt_vec_info, > gimple **, slp_tree, stmt_vector_for_cost *); > +extern bool vectorizable_early_exit (vec_info *, stmt_vec_info, > + gimple_stmt_iterator *, gimple **, > + slp_tree, stmt_vector_for_cost *); > extern bool vect_emulated_vector_p (tree); > extern bool vect_can_vectorize_without_simd_p (tree_code); > extern bool vect_can_vectorize_without_simd_p (code_helper); >
> -----Original Message----- > From: Richard Biener <rguenther@suse.de> > Sent: Friday, October 11, 2024 8:11 AM > To: Tamar Christina <Tamar.Christina@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; jlaw@ventanamicro.com > Subject: RE: [PATCH]middle-end: support SLP early break > > On Thu, 10 Oct 2024, Tamar Christina wrote: > > > > > e.g. if (a != 0) where a is loop invariant. For instance test_memcmp_1_1 > > > > in /gcc.dg/memcmp-1.c is such loop. Technically we should be able to > > > > vectorize such loops, but while we can represent externals in the SLP tree, > > > > we can't start discovery at them, as no stmt_info for them. > > > > > > > > In principle all I need here is an empty SLP tree, since all codegen is driven > > > > by the roots for such invariant compares. However vect_build_slp_tree > > > > doesn't accept empty stmts. > > > > > > The externals would have SLP nodes of course but the requirement > > > currently is that the SLP instance root is an internal def. > > > > > > > I believe we are able to vectorize such loops today, so perhaps instead of > > > > failing we should support building an SLP instance with only roots? > > > > > > It might be tempting but I don't think this is generally useful. > > > > > > > In which case should I try to fit it into vect_build_slp_tree or just special > > > > case it for the gcond discovery? > > > > > > The issue is that you have two operands you technically would like to > > > see code-genrated - the 'a' and the '0' vector invariants, but the > > > SLP instance only has a single root. You could (as I suggested) > > > simply only build the SLP node for the (invariant) LHS of the gcond, > > > not by using vect_build_slp_tree but instead by manually building > > > the SLP tree for the invariant - see what vect_build_slp_tree_2 does > > > here: > > > > > > > Done, > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > Will test more targets closer to commit. > > > > Ok for master? > > > > gcc/ChangeLog: > > > > * tree-vect-loop.cc (vect_analyze_loop_2): Handle SLP trees with no > > children. > > * tree-vectorizer.h (enum slp_instance_kind): Add slp_inst_kind_gcond. > > (LOOP_VINFO_EARLY_BREAKS_LIVE_IVS): New. > > (vectorizable_early_exit): Expose. > > (class _loop_vec_info): Add early_break_live_stmts. > > * tree-vect-slp.cc (vect_build_slp_instance, vect_analyze_slp_instance): > > Support gcond instances. > > (vect_analyze_slp): Analyze gcond roots and early break live statements. > > (maybe_push_to_hybrid_worklist): Don't sink gconds. > > (vect_slp_analyze_operations): Support gconds. > > (vect_slp_check_for_roots): Update comments. > > (vectorize_slp_instance_root_stmt): Support gconds. > > (vect_schedule_slp): Pass vinfo to vectorize_slp_instance_root_stmt. > > * tree-vect-stmts.cc (vect_stmt_relevant_p): Record early break live > > statements. > > (vectorizable_early_exit): Support SLP. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.dg/vect/vect-early-break_126.c: New test. > > * gcc.dg/vect/vect-early-break_127.c: New test. > > * gcc.dg/vect/vect-early-break_128.c: New test. > > > > -- inline copy of patch -- > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c > b/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c > > new file mode 100644 > > index > 0000000000000000000000000000000000000000..4bfc9880f9fc869bf616123 > ff509d13be17ffacf > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c > > @@ -0,0 +1,28 @@ > > +/* { dg-do compile } */ > > +/* { dg-add-options vect_early_break } */ > > +/* { dg-require-effective-target vect_early_break } */ > > +/* { dg-require-effective-target vect_int } */ > > + > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > > +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ > > + > > +#define N 1024 > > +unsigned vect_a[N]; > > +unsigned vect_b[N]; > > + > > +unsigned test4(unsigned x) > > +{ > > + unsigned ret = 0; > > + for (int i = 0; i < N; i++) > > + { > > + vect_b[i] = x + i; > > + if (vect_a[i] > x) > > + { > > + ret *= vect_a[i]; > > + return vect_a[i]; > > + } > > + vect_a[i] = x; > > + ret += vect_a[i] + vect_b[i]; > > + } > > + return ret; > > +} > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c > b/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c > > new file mode 100644 > > index > 0000000000000000000000000000000000000000..67cb5d34a77192e5d7d72 > c35df8e83535ef184ab > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c > > @@ -0,0 +1,27 @@ > > +/* { dg-do compile } */ > > +/* { dg-add-options vect_early_break } */ > > +/* { dg-require-effective-target vect_early_break } */ > > +/* { dg-require-effective-target vect_int } */ > > + > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > > +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ > > + > > +#ifndef N > > +#define N 800 > > +#endif > > +unsigned vect_a[N]; > > +unsigned vect_b[N]; > > + > > +unsigned test4(unsigned x) > > +{ > > + unsigned ret = 0; > > + for (int i = 0; i < N; i++) > > + { > > + vect_b[i] = x + i; > > + if (vect_a[i]*2 != x) > > + break; > > + vect_a[i] = x; > > + > > + } > > + return ret; > > +} > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c > b/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c > > new file mode 100644 > > index > 0000000000000000000000000000000000000000..6d7fb920ec2de529a4aa1d > e2c4a04286989204fd > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c > > @@ -0,0 +1,31 @@ > > +/* { dg-do compile } */ > > +/* { dg-add-options vect_early_break } */ > > +/* { dg-require-effective-target vect_early_break } */ > > +/* { dg-require-effective-target vect_int } */ > > + > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > > +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ > > + > > +#ifndef N > > +#define N 800 > > +#endif > > +unsigned vect_a[N]; > > +unsigned vect_b[N]; > > + > > +unsigned test4(unsigned x) > > +{ > > + unsigned ret = 0; > > + for (int i = 0; i < N; i+=2) > > + { > > + vect_b[i] = x + i; > > + vect_b[i+1] = x + i+1; > > + if (vect_a[i]*2 != x) > > + break; > > + if (vect_a[i+1]*2 != x) > > + break; > > + vect_a[i] = x; > > + vect_a[i+1] = x; > > + > > + } > > + return ret; > > +} > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > > index > 9be50aaa621c6f31d95b626042e780a3bcf05cf3..9292449d6811cebc2d116165 > 3727161521916186 100644 > > --- a/gcc/tree-vect-loop.cc > > +++ b/gcc/tree-vect-loop.cc > > @@ -3258,6 +3258,10 @@ again: > > FOR_EACH_VEC_ELT (LOOP_VINFO_SLP_INSTANCES (loop_vinfo), i, instance) > > { > > stmt_vec_info vinfo; > > + auto scalar_stmt = SLP_TREE_SCALAR_STMTS (SLP_INSTANCE_TREE > (instance)); > > + if (scalar_stmt.is_empty ()) > > + continue; > > + > > Ah, this whole code blob will probably go away. I'd prefer > > if (SLP_TREE_DEF_TYPE (SLP_INSTANCE_TREE (instance)) != > vect_internal_def) > continue; > > > vinfo = SLP_TREE_SCALAR_STMTS (SLP_INSTANCE_TREE (instance))[0]; > > if (! STMT_VINFO_GROUPED_ACCESS (vinfo)) > > continue; > > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > > index > 9bb765e2cbacfa2a6e8872c5e00d32954b3398ba..12acde895f70a3b84bd1359e > 9e3373564c311f0f 100644 > > --- a/gcc/tree-vect-slp.cc > > +++ b/gcc/tree-vect-slp.cc > > @@ -3758,6 +3758,13 @@ vect_build_slp_instance (vec_info *vinfo, > > "Analyzing vectorizable constructor: %G\n", > > root_stmt_infos[0]->stmt); > > } > > + else if (kind == slp_inst_kind_gcond) > > + { > > + if (dump_enabled_p ()) > > + dump_printf_loc (MSG_NOTE, vect_location, > > + "Analyzing vectorizable control flow: %G", > > + root_stmt_infos[0]->stmt); > > + } > > > > if (dump_enabled_p ()) > > { > > @@ -4824,6 +4831,80 @@ vect_analyze_slp (vec_info *vinfo, unsigned > max_tree_size, > > bst_map, NULL, force_single_lane); > > } > > } > > + > > + /* Find SLP sequences starting from gconds. */ > > + for (auto cond : LOOP_VINFO_LOOP_CONDS (loop_vinfo)) > > + { > > + auto cond_info = loop_vinfo->lookup_stmt (cond); > > + > > + cond_info = vect_stmt_to_vectorize (cond_info); > > + vec<stmt_vec_info> roots = vNULL; > > + roots.safe_push (cond_info); > > + gimple *stmt = STMT_VINFO_STMT (cond_info); > > + tree args0 = gimple_cond_lhs (stmt); > > + tree args1 = gimple_cond_rhs (stmt); > > + > > + /* These should be enforced by cond lowering. */ > > + gcc_assert (gimple_cond_code (stmt) == NE_EXPR); > > + gcc_assert (zerop (args1)); > > + > > + /* An argument without a loop def will be codegened from vectorizing > the > > + root gcond itself. As such we don't need to try to build an SLP tree > > + from them. It's highly likely that the resulting SLP tree here if both > > + arguments have a def will be incompatible, but we rely on it being split > > + later on. */ > > + if (auto varg = loop_vinfo->lookup_def (args0)) > > + { > > + vec<stmt_vec_info> stmts; > > + vec<tree> remain = vNULL; > > + stmts.create (1); > > + stmts.quick_push (vect_stmt_to_vectorize (varg)); > > + > > + vect_build_slp_instance (vinfo, slp_inst_kind_gcond, > > + stmts, roots, remain, > > + max_tree_size, &limit, > > + bst_map, NULL, force_single_lane); > > + } > > + else > > + { > > + /* Create a new SLP instance. */ > > + slp_instance new_instance = XNEW (class _slp_instance); > > + vec<tree> ops; > > + ops.create (1); > > + ops.quick_push (args0); > > + slp_tree invnode = vect_create_new_slp_node (ops); > > + SLP_TREE_DEF_TYPE (invnode) = vect_external_def; > > + SLP_INSTANCE_TREE (new_instance) = invnode; > > + SLP_INSTANCE_UNROLLING_FACTOR (new_instance) = 1; > > + SLP_INSTANCE_LOADS (new_instance) = vNULL; > > + SLP_INSTANCE_ROOT_STMTS (new_instance) = roots; > > + SLP_INSTANCE_REMAIN_DEFS (new_instance) = vNULL; > > + SLP_INSTANCE_KIND (new_instance) = slp_inst_kind_gcond; > > + new_instance->reduc_phis = NULL; > > + new_instance->cost_vec = vNULL; > > + new_instance->subgraph_entries = vNULL; > > + vinfo->slp_instances.safe_push (new_instance); > > + } > > + } > > + > > + /* Find and create slp instances for inductions that have been forced > > + live due to early break. */ > > + edge latch_e = loop_latch_edge (LOOP_VINFO_LOOP (loop_vinfo)); > > + for (auto stmt_info : LOOP_VINFO_EARLY_BREAKS_LIVE_IVS (loop_vinfo)) > > + { > > + vec<stmt_vec_info> stmts; > > + vec<stmt_vec_info> roots = vNULL; > > + vec<tree> remain = vNULL; > > + gphi *lc_phi = as_a<gphi *> (STMT_VINFO_STMT (stmt_info)); > > + tree def = gimple_phi_arg_def_from_edge (lc_phi, latch_e); > > + stmt_vec_info lc_info = loop_vinfo->lookup_def (def); > > + stmts.create (1); > > + stmts.quick_push (vect_stmt_to_vectorize (lc_info)); > > + vect_build_slp_instance (vinfo, slp_inst_kind_reduc_group, > > + stmts, roots, remain, > > + max_tree_size, &limit, > > + bst_map, NULL, force_single_lane); > > + } > > } > > > > hash_set<slp_tree> visited_patterns; > > @@ -7239,8 +7320,9 @@ maybe_push_to_hybrid_worklist (vec_info *vinfo, > > } > > } > > } > > - /* No def means this is a loo_vect sink. */ > > - if (!any_def) > > + /* No def means this is a loop_vect sink. Gimple conditionals also don't have a > > + def but shouldn't be considered sinks. */ > > + if (!any_def && STMT_VINFO_DEF_TYPE (stmt_info) != vect_condition_def) > > { > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_NOTE, vect_location, > > @@ -8064,7 +8146,14 @@ vect_slp_analyze_operations (vec_info *vinfo) > > (SLP_INSTANCE_TREE (instance)))))) > > /* Check we can vectorize the reduction. */ > > || (SLP_INSTANCE_KIND (instance) == slp_inst_kind_bb_reduc > > - && !vectorizable_bb_reduc_epilogue (instance, &cost_vec))) > > + && !vectorizable_bb_reduc_epilogue (instance, &cost_vec)) > > + /* Check we can vectorize the gcond. */ > > + || (SLP_INSTANCE_KIND (instance) == slp_inst_kind_gcond > > + && !vectorizable_early_exit (vinfo, > > + SLP_INSTANCE_ROOT_STMTS > (instance)[0], > > + NULL, NULL, > > + SLP_INSTANCE_TREE (instance), > > + &cost_vec))) > > { > > cost_vec.release (); > > slp_tree node = SLP_INSTANCE_TREE (instance); > > @@ -8694,6 +8783,8 @@ vect_slp_check_for_roots (bb_vec_info bb_vinfo) > > !gsi_end_p (gsi); gsi_next (&gsi)) > > { > > gassign *assign = dyn_cast<gassign *> (gsi_stmt (gsi)); > > + /* This can be used to start SLP discovery for early breaks for BB early breaks > > + when we get that far. */ > > if (!assign) > > continue; > > > > @@ -10912,7 +11003,7 @@ vect_remove_slp_scalar_calls (vec_info *vinfo, > slp_tree node) > > /* Vectorize the instance root. */ > > > > void > > -vectorize_slp_instance_root_stmt (slp_tree node, slp_instance instance) > > +vectorize_slp_instance_root_stmt (vec_info *vinfo, slp_tree node, slp_instance > instance) > > { > > gassign *rstmt = NULL; > > > > @@ -11016,6 +11107,21 @@ vectorize_slp_instance_root_stmt (slp_tree node, > slp_instance instance) > > update_stmt (gsi_stmt (rgsi)); > > return; > > } > > + else if (instance->kind == slp_inst_kind_gcond) > > + { > > + /* Only support a single root for now as we can't codegen CFG yet and so we > > + can't support lane > 1 at this time. */ > > + gcc_assert (instance->root_stmts.length () == 1); > > + auto root_stmt_info = instance->root_stmts[0]; > > + auto last_stmt = STMT_VINFO_STMT (root_stmt_info); > > + gimple_stmt_iterator rgsi = gsi_for_stmt (last_stmt); > > + gimple *vec_stmt = NULL; > > + gcc_assert (!SLP_TREE_VEC_DEFS (node).is_empty ()); > > + bool res = vectorizable_early_exit (vinfo, root_stmt_info, &rgsi, > > + &vec_stmt, node, NULL); > > + gcc_assert (res); > > + return; > > + } > > else > > gcc_unreachable (); > > > > @@ -11234,7 +11340,7 @@ vect_schedule_slp (vec_info *vinfo, const > vec<slp_instance> &slp_instances) > > vect_schedule_scc (vinfo, node, instance, scc_info, maxdfs, stack); > > > > if (!SLP_INSTANCE_ROOT_STMTS (instance).is_empty ()) > > - vectorize_slp_instance_root_stmt (node, instance); > > + vectorize_slp_instance_root_stmt (vinfo, node, instance); > > > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_NOTE, vect_location, > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > > index > 43358767934612a9788c587f35f66e007faa8863..bdfc40369a12fb8a15ee9613 > 324c10ad51b31583 100644 > > --- a/gcc/tree-vect-stmts.cc > > +++ b/gcc/tree-vect-stmts.cc > > @@ -411,6 +411,7 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, > loop_vec_info loop_vinfo, > > dump_printf_loc (MSG_NOTE, vect_location, > > "vec_stmt_relevant_p: induction forced for " > > "early break.\n"); > > + LOOP_VINFO_EARLY_BREAKS_LIVE_IVS (loop_vinfo).safe_push (stmt_info); > > *live_p = true; > > > > } > > @@ -13004,7 +13005,7 @@ vectorizable_comparison (vec_info *vinfo, > > /* Check to see if the current early break given in STMT_INFO is valid for > > vectorization. */ > > > > -static bool > > +bool > > vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, > > gimple_stmt_iterator *gsi, gimple **vec_stmt, > > slp_tree slp_node, stmt_vector_for_cost *cost_vec) > > @@ -13028,8 +13029,13 @@ vectorizable_early_exit (vec_info *vinfo, > stmt_vec_info stmt_info, > > slp_tree slp_op0; > > tree op0; > > enum vect_def_type dt0; > > - if (!vect_is_simple_use (vinfo, stmt_info, slp_node, 0, &op0, &slp_op0, &dt0, > > - &vectype)) > > + > > + /* Early break gcond kind SLP trees can be root only and have no children, > > + for instance in the case where the argument is an external. If that's > > + the case there is no operand to analyse use of. */ > > + if ((!slp_node || !SLP_TREE_CHILDREN (slp_node).is_empty ()) > > + && !vect_is_simple_use (vinfo, stmt_info, slp_node, 0, &op0, &slp_op0, > &dt0, > > + &vectype)) > > Hmm, but doesn't this somehow show we're mixing things up here? > stmt_info seems to be the gcond but slp_node is for the pattern > condition (or alternatively the invariant condition). > > But yeah, I guess that's how we set this up. Unless you want to > create the pattern also for invariant conditions (why not?). That's a good point, We can do so now that we can lift the invariant statements during pattern matching. Previously it would have affected costing so would have biased the vector loop. I'll send a follow up doing so. Thanks, Tamar > > I think the patch is good to go with the minor adjustment in the first > hunk. > > Thanks, > Richard. > > > { > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > @@ -13037,16 +13043,30 @@ vectorizable_early_exit (vec_info *vinfo, > stmt_vec_info stmt_info, > > return false; > > } > > > > + /* For SLP we don't want to use the type of the operands of the SLP node, > when > > + vectorizing using SLP slp_node will be the children of the gcond and we > > + want to use the type of the direct children which since the gcond is root > > + will be the current node, rather than a child node as vect_is_simple_use > > + assumes. */ > > + if (slp_node) > > + vectype = SLP_TREE_VECTYPE (slp_node); > > + > > if (!vectype) > > return false; > > > > machine_mode mode = TYPE_MODE (vectype); > > - int ncopies; > > + int ncopies, vec_num; > > > > if (slp_node) > > - ncopies = 1; > > + { > > + ncopies = 1; > > + vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); > > + } > > else > > - ncopies = vect_get_num_copies (loop_vinfo, vectype); > > + { > > + ncopies = vect_get_num_copies (loop_vinfo, vectype); > > + vec_num = 1; > > + } > > > > vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo); > > vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); > > @@ -13115,9 +13135,11 @@ vectorizable_early_exit (vec_info *vinfo, > stmt_vec_info stmt_info, > > { > > if (direct_internal_fn_supported_p (IFN_VCOND_MASK_LEN, vectype, > > OPTIMIZE_FOR_SPEED)) > > - vect_record_loop_len (loop_vinfo, lens, ncopies, vectype, 1); > > + vect_record_loop_len (loop_vinfo, lens, ncopies * vec_num, > > + vectype, 1); > > else > > - vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype, NULL); > > + vect_record_loop_mask (loop_vinfo, masks, ncopies * vec_num, > > + vectype, NULL); > > } > > > > return true; > > @@ -13131,9 +13153,18 @@ vectorizable_early_exit (vec_info *vinfo, > stmt_vec_info stmt_info, > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_NOTE, vect_location, "transform early-exit.\n"); > > > > - if (!vectorizable_comparison_1 (vinfo, vectype, stmt_info, code, gsi, > > - vec_stmt, slp_node, cost_vec)) > > - gcc_unreachable (); > > + /* For SLP we don't do codegen of the body starting from the gcond, the > gconds are > > + roots and so by the time we get to them we have already codegened the SLP > tree > > + and so we shouldn't try to do so again. The arguments have already been > > + vectorized. It's not very clean to do this here, But the masking code below is > > + complex and this keeps it all in one place to ease fixes and backports. Once > we > > + drop the non-SLP loop vect or split vectorizable_* this can be simplified. */ > > + if (!slp_node) > > + { > > + if (!vectorizable_comparison_1 (vinfo, vectype, stmt_info, code, gsi, > > + vec_stmt, slp_node, cost_vec)) > > + gcc_unreachable (); > > + } > > > > gimple *stmt = STMT_VINFO_STMT (stmt_info); > > basic_block cond_bb = gimple_bb (stmt); > > @@ -13165,8 +13196,8 @@ vectorizable_early_exit (vec_info *vinfo, > stmt_vec_info stmt_info, > > for (unsigned i = 0; i < stmts.length (); i++) > > { > > tree stmt_mask > > - = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies, vectype, > > - i); > > + = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies * vec_num, > > + vectype, i); > > stmt_mask > > = prepare_vec_mask (loop_vinfo, TREE_TYPE (stmt_mask), stmt_mask, > > stmts[i], &cond_gsi); > > @@ -13176,8 +13207,8 @@ vectorizable_early_exit (vec_info *vinfo, > stmt_vec_info stmt_info, > > for (unsigned i = 0; i < stmts.length (); i++) > > { > > tree len_mask = vect_gen_loop_len_mask (loop_vinfo, gsi, &cond_gsi, > > - lens, ncopies, vectype, > > - stmts[i], i, 1); > > + lens, ncopies * vec_num, > > + vectype, stmts[i], i, 1); > > > > workset.quick_push (len_mask); > > } > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h > > index > 73bccb5a40a8be4d3f5f600239d63c756d2e35e7..5028dd154d1c114c6ca3a4c6 > bf068cffef3aaa6c 100644 > > --- a/gcc/tree-vectorizer.h > > +++ b/gcc/tree-vectorizer.h > > @@ -252,7 +252,8 @@ enum slp_instance_kind { > > slp_inst_kind_reduc_group, > > slp_inst_kind_reduc_chain, > > slp_inst_kind_bb_reduc, > > - slp_inst_kind_ctor > > + slp_inst_kind_ctor, > > + slp_inst_kind_gcond > > }; > > > > /* SLP instance is a sequence of stmts in a loop that can be packed into > > @@ -977,6 +978,10 @@ public: > > /* Statements whose VUSES need updating if early break vectorization is to > > happen. */ > > auto_vec<gimple*> early_break_vuses; > > + > > + /* Record statements that are needed to be live for early break vectorization > > + but may not have an LC PHI node materialized yet in the exits. */ > > + auto_vec<stmt_vec_info> early_break_live_ivs; > > } *loop_vec_info; > > > > /* Access Functions. */ > > @@ -1036,6 +1041,8 @@ public: > > #define LOOP_VINFO_EARLY_BRK_STORES(L) (L)->early_break_stores > > #define LOOP_VINFO_EARLY_BREAKS_VECT_PEELED(L) \ > > (single_pred ((L)->loop->latch) != (L)->vec_loop_iv_exit->src) > > +#define LOOP_VINFO_EARLY_BREAKS_LIVE_IVS(L) \ > > + (L)->early_break_live_ivs > > #define LOOP_VINFO_EARLY_BRK_DEST_BB(L) (L)->early_break_dest_bb > > #define LOOP_VINFO_EARLY_BRK_VUSES(L) (L)->early_break_vuses > > #define LOOP_VINFO_LOOP_CONDS(L) (L)->conds > > @@ -2524,6 +2531,9 @@ extern bool vectorizable_phi (vec_info *, > stmt_vec_info, gimple **, slp_tree, > > stmt_vector_for_cost *); > > extern bool vectorizable_recurr (loop_vec_info, stmt_vec_info, > > gimple **, slp_tree, stmt_vector_for_cost *); > > +extern bool vectorizable_early_exit (vec_info *, stmt_vec_info, > > + gimple_stmt_iterator *, gimple **, > > + slp_tree, stmt_vector_for_cost *); > > extern bool vect_emulated_vector_p (tree); > > extern bool vect_can_vectorize_without_simd_p (tree_code); > > extern bool vect_can_vectorize_without_simd_p (code_helper); > > > > -- > Richard Biener <rguenther@suse.de> > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c new file mode 100644 index 0000000000000000000000000000000000000000..4bfc9880f9fc869bf616123ff509d13be17ffacf --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_126.c @@ -0,0 +1,28 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ + +#define N 1024 +unsigned vect_a[N]; +unsigned vect_b[N]; + +unsigned test4(unsigned x) +{ + unsigned ret = 0; + for (int i = 0; i < N; i++) + { + vect_b[i] = x + i; + if (vect_a[i] > x) + { + ret *= vect_a[i]; + return vect_a[i]; + } + vect_a[i] = x; + ret += vect_a[i] + vect_b[i]; + } + return ret; +} diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c new file mode 100644 index 0000000000000000000000000000000000000000..67cb5d34a77192e5d7d72c35df8e83535ef184ab --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_127.c @@ -0,0 +1,27 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ + +#ifndef N +#define N 800 +#endif +unsigned vect_a[N]; +unsigned vect_b[N]; + +unsigned test4(unsigned x) +{ + unsigned ret = 0; + for (int i = 0; i < N; i++) + { + vect_b[i] = x + i; + if (vect_a[i]*2 != x) + break; + vect_a[i] = x; + + } + return ret; +} diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c new file mode 100644 index 0000000000000000000000000000000000000000..6d7fb920ec2de529a4aa1de2c4a04286989204fd --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_128.c @@ -0,0 +1,31 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */ + +#ifndef N +#define N 800 +#endif +unsigned vect_a[N]; +unsigned vect_b[N]; + +unsigned test4(unsigned x) +{ + unsigned ret = 0; + for (int i = 0; i < N; i+=2) + { + vect_b[i] = x + i; + vect_b[i+1] = x + i+1; + if (vect_a[i]*2 != x) + break; + if (vect_a[i+1]*2 != x) + break; + vect_a[i] = x; + vect_a[i+1] = x; + + } + return ret; +} diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 600987dd6e5d506aa5fbb02350f9dab77793d382..7e765df466a59249feb999c24d8f2dad232948ae 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -3697,6 +3697,13 @@ vect_build_slp_instance (vec_info *vinfo, "Analyzing vectorizable constructor: %G\n", root_stmt_infos[0]->stmt); } + else if (kind == slp_inst_kind_gcond) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "Analyzing vectorizable control flow: %G", + root_stmt_infos[0]->stmt); + } if (dump_enabled_p ()) { @@ -4143,6 +4150,12 @@ vect_analyze_slp_instance (vec_info *vinfo, STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)) = STMT_VINFO_REDUC_DEF (vect_orig_stmt (scalar_stmts.last ())); } + else if (kind == slp_inst_kind_gcond) + { + /* Collect the stores and store them in scalar_stmts. */ + scalar_stmts.create (1); + scalar_stmts.quick_push (vect_stmt_to_vectorize (next_info)); + } else gcc_unreachable (); @@ -4742,6 +4755,56 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size, bst_map, NULL, force_single_lane); } } + + /* Find SLP sequences starting from gconds. */ + for (auto cond : LOOP_VINFO_LOOP_CONDS (loop_vinfo)) + { + auto cond_info = loop_vinfo->lookup_stmt (cond); + vec<stmt_vec_info> stmts; + vec<stmt_vec_info> roots = vNULL; + vec<tree> remain = vNULL; + + cond_info = vect_stmt_to_vectorize (cond_info); + roots.safe_push (cond_info); + stmts.create (2); + tree args0 = gimple_cond_lhs (STMT_VINFO_STMT (cond_info)); + tree args1 = gimple_cond_rhs (STMT_VINFO_STMT (cond_info)); + /* An argument without a loop def will be codegened from vectorizing the + root gcond itself. As such we don't need to try to build an SLP tree + from them. It's highly likely that the resulting SLP tree here if both + arguments have a def will be incompatible, but we rely on it being split + later on. */ + if (auto varg = loop_vinfo->lookup_def (args0)) + stmts.quick_push (vect_stmt_to_vectorize (varg)); + + if (auto varg = loop_vinfo->lookup_def (args1)) + stmts.quick_push (vect_stmt_to_vectorize (varg)); + + if (!stmts.is_empty ()) + vect_build_slp_instance (vinfo, slp_inst_kind_gcond, + stmts, roots, remain, + max_tree_size, &limit, + bst_map, NULL, force_single_lane); + } + + /* Find and create slp instances for inductions that have been forced + live due to early break. */ + edge latch_e = loop_latch_edge (LOOP_VINFO_LOOP (loop_vinfo)); + for (auto stmt_info : LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS (loop_vinfo)) + { + vec<stmt_vec_info> stmts; + vec<stmt_vec_info> roots = vNULL; + vec<tree> remain = vNULL; + gphi *lc_phi = as_a<gphi *> (STMT_VINFO_STMT (stmt_info)); + tree def = gimple_phi_arg_def_from_edge (lc_phi, latch_e); + stmt_vec_info lc_info = loop_vinfo->lookup_def (def); + stmts.create (1); + stmts.quick_push (vect_stmt_to_vectorize (lc_info)); + vect_build_slp_instance (vinfo, slp_inst_kind_reduc_group, + stmts, roots, remain, + max_tree_size, &limit, + bst_map, NULL, force_single_lane); + } } hash_set<slp_tree> visited_patterns; @@ -7157,8 +7220,9 @@ maybe_push_to_hybrid_worklist (vec_info *vinfo, } } } - /* No def means this is a loo_vect sink. */ - if (!any_def) + /* No def means this is a loop_vect sink. Gimple conditionals also don't have a + def but shouldn't be considered sinks. */ + if (!any_def && STMT_VINFO_DEF_TYPE (stmt_info) != vect_condition_def) { if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, @@ -7542,9 +7606,27 @@ vect_slp_analyze_node_operations (vec_info *vinfo, slp_tree node, return true; visited_vec.safe_push (node); + /* If early break also check the root statement as we need to both analyze + and trigger codegen for it. The analysis will check whether can actually + vectorize it. At the memoment splitting off the analsysi bit from inside + it duplicates a lot of the setup code so it's not worth while to do so. + However when either the non-SLP loop vect goes away or we split vectorizable_* + functions then we can call the analysis only part from here instead. */ bool res = true; - unsigned visited_rec_start = visited_vec.length (); unsigned cost_vec_rec_start = cost_vec->length (); + if (SLP_INSTANCE_KIND (node_instance) == slp_inst_kind_gcond) + { + auto root_stmt_info = SLP_INSTANCE_ROOT_STMTS (node_instance)[0]; + res = vectorizable_early_exit (vinfo, root_stmt_info, NULL, NULL, NULL, + cost_vec); + if (!res) + { + cost_vec->truncate (cost_vec_rec_start); + return res; + } + } + + unsigned visited_rec_start = visited_vec.length (); bool seen_non_constant_child = false; FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child) { @@ -8612,6 +8694,8 @@ vect_slp_check_for_roots (bb_vec_info bb_vinfo) !gsi_end_p (gsi); gsi_next (&gsi)) { gassign *assign = dyn_cast<gassign *> (gsi_stmt (gsi)); + /* This can be used to start SLP discovery for early breaks for BB early breaks + when we get that far. */ if (!assign) continue; @@ -10758,7 +10842,7 @@ vect_remove_slp_scalar_calls (vec_info *vinfo, slp_tree node) /* Vectorize the instance root. */ void -vectorize_slp_instance_root_stmt (slp_tree node, slp_instance instance) +vectorize_slp_instance_root_stmt (vec_info *vinfo, slp_tree node, slp_instance instance) { gassign *rstmt = NULL; @@ -10862,6 +10946,21 @@ vectorize_slp_instance_root_stmt (slp_tree node, slp_instance instance) update_stmt (gsi_stmt (rgsi)); return; } + else if (instance->kind == slp_inst_kind_gcond) + { + /* Only support a single root for now as we can't codegen CFG yet and so we + can't support lane > 1 at this time. */ + gcc_assert (instance->root_stmts.length () == 1); + auto root_stmt_info = instance->root_stmts[0]; + auto last_stmt = vect_find_first_scalar_stmt_in_slp (node)->stmt; + gimple_stmt_iterator rgsi = gsi_for_stmt (last_stmt); + gimple *vec_stmt = NULL; + gcc_assert (SLP_TREE_NUMBER_OF_VEC_STMTS (node) != 0); + bool res = vectorizable_early_exit (vinfo, root_stmt_info, &rgsi, + &vec_stmt, node, NULL); + gcc_assert (res); + return; + } else gcc_unreachable (); @@ -11080,7 +11179,7 @@ vect_schedule_slp (vec_info *vinfo, const vec<slp_instance> &slp_instances) vect_schedule_scc (vinfo, node, instance, scc_info, maxdfs, stack); if (!SLP_INSTANCE_ROOT_STMTS (instance).is_empty ()) - vectorize_slp_instance_root_stmt (node, instance); + vectorize_slp_instance_root_stmt (vinfo, node, instance); if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index b72b54d666879d8485f8d972b4e8d9dc64bc86b3..8f3f35989879199ffd0eb24729cb7ade856a3c4d 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -411,6 +411,7 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, loop_vec_info loop_vinfo, dump_printf_loc (MSG_NOTE, vect_location, "vec_stmt_relevant_p: induction forced for " "early break.\n"); + LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS (loop_vinfo).safe_push (stmt_info); *live_p = true; } @@ -12933,7 +12934,7 @@ vectorizable_comparison (vec_info *vinfo, /* Check to see if the current early break given in STMT_INFO is valid for vectorization. */ -static bool +bool vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, gimple_stmt_iterator *gsi, gimple **vec_stmt, slp_tree slp_node, stmt_vector_for_cost *cost_vec) @@ -12958,7 +12959,7 @@ vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, tree op0; enum vect_def_type dt0; if (!vect_is_simple_use (vinfo, stmt_info, slp_node, 0, &op0, &slp_op0, &dt0, - &vectype)) + &vectype)) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, @@ -12966,6 +12967,13 @@ vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, return false; } + /* For SLP we don't want to use the type of the operands of the SLP node, when + vectorizing using SLP slp_node will be the children of the gcond and we want to + use the type of the direct children which since the gcond is root will be the + current node, rather than a child node as vect_is_simple_use assumes. */ + if (slp_node) + vectype = SLP_TREE_VECTYPE (slp_node); + if (!vectype) return false; @@ -13060,9 +13068,18 @@ vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "transform early-exit.\n"); - if (!vectorizable_comparison_1 (vinfo, vectype, stmt_info, code, gsi, - vec_stmt, slp_node, cost_vec)) - gcc_unreachable (); + /* For SLP we don't do codegen of the body starting from the gcond, the gconds are + roots and so by the time we get to them we have already codegened the SLP tree + and so we shouldn't try to do so again. The arguments have already been + vectorized. It's not very clean to do this here, But the masking code below is + complex and this keeps it all in one place to ease fixes and backports. Once we + drop the non-SLP loop vect or split vectorizable_* this can be simplified. */ + if (!slp_node) + { + if (!vectorizable_comparison_1 (vinfo, vectype, stmt_info, code, gsi, + vec_stmt, slp_node, cost_vec)) + gcc_unreachable (); + } gimple *stmt = STMT_VINFO_STMT (stmt_info); basic_block cond_bb = gimple_bb (stmt); diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 490061aea2f6d465d9589eb97bbd34a920d76b1c..53483303c4ac3482760fe722354f602e0243e5a2 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -252,7 +252,8 @@ enum slp_instance_kind { slp_inst_kind_reduc_group, slp_inst_kind_reduc_chain, slp_inst_kind_bb_reduc, - slp_inst_kind_ctor + slp_inst_kind_ctor, + slp_inst_kind_gcond }; /* SLP instance is a sequence of stmts in a loop that can be packed into @@ -977,6 +978,10 @@ public: /* Statements whose VUSES need updating if early break vectorization is to happen. */ auto_vec<gimple*> early_break_vuses; + + /* Record statements that are needed to be live for early break vectorization + but may not have an LC PHI node materialized yet in the exits. */ + auto_vec<stmt_vec_info> early_break_live_stmts; } *loop_vec_info; /* Access Functions. */ @@ -1036,6 +1041,8 @@ public: #define LOOP_VINFO_EARLY_BRK_STORES(L) (L)->early_break_stores #define LOOP_VINFO_EARLY_BREAKS_VECT_PEELED(L) \ (single_pred ((L)->loop->latch) != (L)->vec_loop_iv_exit->src) +#define LOOP_VINFO_EARLY_BREAKS_LIVE_STMTS(L) \ + (L)->early_break_live_stmts #define LOOP_VINFO_EARLY_BRK_DEST_BB(L) (L)->early_break_dest_bb #define LOOP_VINFO_EARLY_BRK_VUSES(L) (L)->early_break_vuses #define LOOP_VINFO_LOOP_CONDS(L) (L)->conds @@ -2522,6 +2529,9 @@ extern bool vectorizable_phi (vec_info *, stmt_vec_info, gimple **, slp_tree, stmt_vector_for_cost *); extern bool vectorizable_recurr (loop_vec_info, stmt_vec_info, gimple **, slp_tree, stmt_vector_for_cost *); +extern bool vectorizable_early_exit (vec_info *, stmt_vec_info, + gimple_stmt_iterator *, gimple **, + slp_tree, stmt_vector_for_cost *); extern bool vect_emulated_vector_p (tree); extern bool vect_can_vectorize_without_simd_p (tree_code); extern bool vect_can_vectorize_without_simd_p (code_helper);