Message ID | 385547e6-abbd-3633-ad69-d4fb6e604c97@arm.com |
---|---|
State | New |
Headers | show |
Series | [1/2,vect] PR 88915: Vectorize epilogues when versioning loops | expand |
On Fri, 23 Aug 2019, Andre Vieira (lists) wrote: > Hi, > > This patch is an improvement on my last RFC. As you pointed out, we can do > the vectorization analysis of the epilogues before doing the transformation, > using the same approach as used by openmp simd. I have not yet incorporated > the cost tweaks for vectorizing the epilogue, I would like to do this in a > subsequent patch, to make it easier to test the differences. > > I currently disable the vectorization of epilogues when versioning for > iterations. This is simply because I do not completely understand how the > assumptions are created and couldn't determine whether using skip_vectors with > this would work. If you don't think it is a problem or have a testcase to > show it work I would gladly look at it. I don't think there's any problem. Basically the versioning condition is if (can_we_compute_niter), most of the time it is an extra condition from niter analysis under which niter is for example zero. This should also be the same with all vector sizes. - delete loop_vinfo; + { + /* Set versioning threshold of the original LOOP_VINFO based + on the last vectorization of the epilog. */ + LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo) + = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo); + delete loop_vinfo; + } I'm not sure this works reliably since the order we process vector sizes is under target control and not necessarily decreasing. I think you want to keep track of the minimum here? Preferably separately I guess. From what I see vect_analyze_loop_2 doesn't need vect_epilogues_nomask and thus it doesn't change throughout the iteration. else - delete loop_vinfo; + { + /* Disable epilog vectorization if we can't determine the epilogs can + be vectorized. */ + *vect_epilogues_nomask &= vectorized_loops > 1; + delete loop_vinfo; + } and this is a bit premature and instead it should be done just before returning success? Maybe also storing the epilogue vector sizes we can handle in the loop_vinfo, thereby representing !vect_epilogues_nomask if there are no such sizes which also means that @@ -1013,8 +1015,13 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab, /* Epilogue of vectorized loop must be vectorized too. */ if (new_loop) - ret |= try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops, - new_loop, loop_vinfo, NULL, NULL); + { + /* Don't include vectorized epilogues in the "vectorized loops" count. + */ + unsigned dont_count = *num_vectorized_loops; + ret |= try_vectorize_loop_1 (simduid_to_vf_htab, &dont_count, + new_loop, loop_vinfo, NULL, NULL); + } can be optimized to not re-check all smaller sizes (but even assert re-analysis succeeds to the original result for the actual transform). Otherwise this looks reasonable to me. Thanks, Richard. > > Bootstrapped this and the next patch on x86_64 and aarch64-unknown-linux-gnu, > with no regressions (after test changes in next patch). > > gcc/ChangeLog: > 2019-08-23 Andre Vieira <andre.simoesdiasvieira@arm.com> > > PR 88915 > * gentype.c (main): Add poly_uint64 type to generator. > * tree-vect-loop.c (vect_analyze_loop_2): Make it determine > whether we vectorize epilogue loops. > (vect_analyze_loop): Idem. > (vect_transform_loop): Pass decision to vectorize epilogues > to vect_do_peeling. > * tree-vect-loop-manip.c (vect_do_peeling): Enable skip-vectors > when doing loop versioning if we decided to vectorize epilogues. > (vect-loop_versioning): Moved decision to check_profitability > based on cost model. > * tree-vectorizer.h (vect_loop_versioning, vect_do_peeling, > vect_analyze_loop, vect_transform_loop): Update declarations. > * tree-vectorizer.c: Include params.h > (try_vectorize_loop_1): Initialize vect_epilogues_nomask > to PARAM_VECT_EPILOGUES_NOMASK and pass it to vect_analyze_loop > and vect_transform_loop. Also make sure vectorizing epilogues > does not count towards number of vectorized loops. > >
On 8/23/19 10:50 AM, Andre Vieira (lists) wrote: > Hi, > > This patch is an improvement on my last RFC. As you pointed out, we can > do the vectorization analysis of the epilogues before doing the > transformation, using the same approach as used by openmp simd. I have > not yet incorporated the cost tweaks for vectorizing the epilogue, I > would like to do this in a subsequent patch, to make it easier to test > the differences. > > I currently disable the vectorization of epilogues when versioning for > iterations. This is simply because I do not completely understand how > the assumptions are created and couldn't determine whether using > skip_vectors with this would work. If you don't think it is a problem > or have a testcase to show it work I would gladly look at it. > > Bootstrapped this and the next patch on x86_64 and > aarch64-unknown-linux-gnu, with no regressions (after test changes in > next patch). > > gcc/ChangeLog: > 2019-08-23 Andre Vieira <andre.simoesdiasvieira@arm.com> > > PR 88915 > * gentype.c (main): Add poly_uint64 type to generator. > * tree-vect-loop.c (vect_analyze_loop_2): Make it determine > whether we vectorize epilogue loops. > (vect_analyze_loop): Idem. > (vect_transform_loop): Pass decision to vectorize epilogues > to vect_do_peeling. > * tree-vect-loop-manip.c (vect_do_peeling): Enable skip-vectors > when doing loop versioning if we decided to vectorize epilogues. > (vect-loop_versioning): Moved decision to check_profitability > based on cost model. > * tree-vectorizer.h (vect_loop_versioning, vect_do_peeling, > vect_analyze_loop, vect_transform_loop): Update declarations. > * tree-vectorizer.c: Include params.h > (try_vectorize_loop_1): Initialize vect_epilogues_nomask > to PARAM_VECT_EPILOGUES_NOMASK and pass it to vect_analyze_loop > and vect_transform_loop. Also make sure vectorizing epilogues > does not count towards number of vectorized loops. Nit. In several places you use "epilog", proper spelling is "epilogue". > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c > index 173e6b51652fd023893b38da786ff28f827553b5..25c3fc8ff55e017ae0b971fa93ce8ce2a07cb94c 100644 > --- a/gcc/tree-vectorizer.c > +++ b/gcc/tree-vectorizer.c > @@ -1013,8 +1015,13 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab, > > /* Epilogue of vectorized loop must be vectorized too. */ > if (new_loop) > - ret |= try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops, > - new_loop, loop_vinfo, NULL, NULL); > + { > + /* Don't include vectorized epilogues in the "vectorized loops" count. > + */ > + unsigned dont_count = *num_vectorized_loops; > + ret |= try_vectorize_loop_1 (simduid_to_vf_htab, &dont_count, > + new_loop, loop_vinfo, NULL, NULL); > + } Nit. Don't wrap a comment with just the closing */ on its own line. Instead wrap before "count" so that. This is fine for the trunk after fixing those nits. jeff
Hi Richard, As I mentioned in the IRC channel, this is my current work in progress patch. It currently ICE's when vectorizing gcc/testsuite/gcc.c-torture/execute/nestfunc-2.c with '-O3' and '--param vect-epilogues-nomask=1'. It ICE's because the epilogue loop (after if conversion) and main loop (before vectorization) are not the same, there are a bunch of extra BBs and the epilogue loop seems to need some cleaning up too. Let me know if you see a way around this issue. Cheers, Andre
Hi Richard, As I mentioned in the IRC channel, I managed to get "most" of the regression testsuite working for x86_64 (avx512) and aarch64. On x86_64 I get a failure that I can't explain, was hoping you might be able to have a look with me: "PASS->FAIL: gcc.target/i386/vect-perm-odd-1.c execution test" vect-perm-odd-1.exe segfaults and when I gdb it seems to be the first iteration of the main loop. The tree dumps look alright, but I do notice the stack usage seems to change between --param vect-epilogue-nomask={0,1}. Am I missing to update some field that may later lead to the amount of stack being used? I am confused, it could very well be that I am missing something obvious, I am not too familiar with x86's ISA. I will try to investigate further. This patch needs further clean-up and more comments (or comment updates), but I thought I'd share current state to see if you can help me unblock. Cheers, Andre
On Tue, 8 Oct 2019, Andre Vieira (lists) wrote: > Hi Richard, > > As I mentioned in the IRC channel, I managed to get "most" of the regression > testsuite working for x86_64 (avx512) and aarch64. > > On x86_64 I get a failure that I can't explain, was hoping you might be able > to have a look with me: > "PASS->FAIL: gcc.target/i386/vect-perm-odd-1.c execution test" > > vect-perm-odd-1.exe segfaults and when I gdb it seems to be the first > iteration of the main loop. The tree dumps look alright, but I do notice the > stack usage seems to change between --param vect-epilogue-nomask={0,1}. So the issue is that we have => 0x0000000000400778 <+72>: vmovdqa64 %zmm1,-0x40(%rax) but the memory accessed is not appropriately aligned. The vectorizer sets DECL_USER_ALIGN on the stack local but somehow later it downs it to 256: Old value = 640 New value = 576 ensure_base_align (dr_info=0x526f788) at /tmp/trunk/gcc/tree-vect-stmts.c:6294 6294 DECL_USER_ALIGN (base_decl) = 1; (gdb) l 6289 if (decl_in_symtab_p (base_decl)) 6290 symtab_node::get (base_decl)->increase_alignment (align_base_to); 6291 else 6292 { 6293 SET_DECL_ALIGN (base_decl, align_base_to); 6294 DECL_USER_ALIGN (base_decl) = 1; 6295 } this means vectorizing the epilogue modifies the DRs, in particular the base alignment? > Am I missing to update some field that may later lead to the amount of stack > being used? I am confused, it could very well be that I am missing something > obvious, I am not too familiar with x86's ISA. I will try to investigate > further. > > This patch needs further clean-up and more comments (or comment updates), but > I thought I'd share current state to see if you can help me unblock. > > Cheers, > Andre >
Hi, After all the discussions and respins I now believe this patch is close to what we envisioned. This patch achieves two things when vect-epilogues-nomask=1: 1) It analyzes the original loop for each supported vector size and saves this analysis per loop, as well as the vector sizes we know we can vectorize the loop for. 2) When loop versioning it uses the 'skip_vector' code path to vectorize the epilogue, and uses the lowest versioning threshold between the main and epilogue's. As side effects of this patch I also changed ensure_base_align to only update the alignment if the new alignment is lower than the current one. This function already did that if the object was a symbol, now it behaves this way for any object. I bootstrapped this patch with both vect-epilogues-nomask turned on and off on x86_64 (AVX512) and aarch64. Regression tests looked good. Is this OK for trunk? gcc/ChangeLog: 2019-10-10 Andre Vieira <andre.simoesdiasvieira@arm.com> PR 88915 * cfgloop.h (loop): Add epilogue_vsizes member. * cfgloop.c (flow_loop_free): Release epilogue_vsizes. (alloc_loop): Initialize epilogue_vsizes. * gentype.c (main): Add poly_uint64 type and vector_sizes to generator. * tree-vect-loop.c (vect_get_loop_niters): Make externally visible. (_loop_vec_info): Initialize epilogue_vinfos. (~_loop_vec_info): Release epilogue_vinfos. (vect_analyze_loop_costing): Use knowledge of main VF to estimate number of iterations of epilogue. (determine_peel_for_niter): New. Outlined code to re-use in two places. (vect_analyze_loop_2): Adapt to analyse main loop for all supported vector sizes when vect-epilogues-nomask=1. Also keep track of lowest versioning threshold needed for main loop. (vect_analyze_loop): Likewise. (replace_ops): New helper function. (vect_transform_loop): When vectorizing epilogues re-use analysis done on main loop and update necessary information. * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert stmts on loop preheader edge. (vect_do_peeling): Enable skip-vectors when doing loop versioning if we decided to vectorize epilogues. Update epilogues NITERS and construct ADVANCE to update epilogues data references where needed. (vect_loop_versioning): Moved decision to check_profitability based on cost model. * tree-vect-stmts.c (ensure_base_align): Only update alignment if new alignment is lower. * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos member. (vect_loop_versioning, vect_do_peeling, vect_get_loop_niters, vect_update_inits_of_drs, determine_peel_for_niter, vect_analyze_loop): Add or update declarations. * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already create loop_vec_info's for epilogues when available. Otherwise analyse epilogue separately. Cheers, Andre
On Thu, 10 Oct 2019, Andre Vieira (lists) wrote: > Hi, > > After all the discussions and respins I now believe this patch is close to > what we envisioned. > > This patch achieves two things when vect-epilogues-nomask=1: > 1) It analyzes the original loop for each supported vector size and saves this > analysis per loop, as well as the vector sizes we know we can vectorize the > loop for. > 2) When loop versioning it uses the 'skip_vector' code path to vectorize the > epilogue, and uses the lowest versioning threshold between the main and > epilogue's. > > As side effects of this patch I also changed ensure_base_align to only update > the alignment if the new alignment is lower than the current one. This > function already did that if the object was a symbol, now it behaves this way > for any object. > > I bootstrapped this patch with both vect-epilogues-nomask turned on and off on > x86_64 (AVX512) and aarch64. Regression tests looked good. > > Is this OK for trunk? + + /* Keep track of vector sizes we know we can vectorize the epilogue with. */ + vector_sizes epilogue_vsizes; }; please don't enlarge struct loop, instead track this somewhere in the vectorizer (in loop_vinfo? I see you already have epilogue_vinfos there - so the loop_vinfo simply lacks convenient access to the vector_size?) I don't see any use that could be trivially adjusted to look at a loop_vinfo member instead. For the vect_update_inits_of_drs this means that we'd possibly do less CSE. Not sure if really an issue. You use LOOP_VINFO_EPILOGUE_P sometimes and sometimes LOOP_VINFO_ORIG_LOOP_INFO, please change predicates to LOOP_VINFO_EPILOGUE_P. @@ -2466,15 +2461,62 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, else niters_prolog = build_int_cst (type, 0); + loop_vec_info epilogue_vinfo = NULL; + if (vect_epilogues) + { ... + vect_epilogues = false; + } + I don't understand what all this does - it clearly needs a comment. Maybe the overall comment of the function should be amended with an overview of how we handle [multiple] epilogue loop vectorization? + + if (epilogue_any_upper_bound && prolog_peeling >= 0) + { + epilog->any_upper_bound = true; + epilog->nb_iterations_upper_bound = eiters + 1; + } + comment missing. How can prolog_peeling be < 0? We likely didn't set the upper bound because we don't know it in the case we skipped the vector loop (skip_vector)? So make sure to not introduce wrong-code issues here - maybe do this optimization as followup? class loop * -vect_loop_versioning (loop_vec_info loop_vinfo, - unsigned int th, bool check_profitability, - poly_uint64 versioning_threshold) +vect_loop_versioning (loop_vec_info loop_vinfo) { class loop *loop = LOOP_VINFO_LOOP (loop_vinfo), *nloop; class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo); @@ -2988,10 +3151,15 @@ vect_loop_versioning (loop_vec_info loop_vinfo, bool version_align = LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT (loop_vinfo); bool version_alias = LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo); bool version_niter = LOOP_REQUIRES_VERSIONING_FOR_NITERS (loop_vinfo); + poly_uint64 versioning_threshold + = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo); tree version_simd_if_cond = LOOP_REQUIRES_VERSIONING_FOR_SIMD_IF_COND (loop_vinfo); + unsigned th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo); - if (check_profitability) + if (th >= vect_vf_for_cost (loop_vinfo) + && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) + && !ordered_p (th, versioning_threshold)) cond_expr = fold_build2 (GE_EXPR, boolean_type_node, scalar_loop_iters, build_int_cst (TREE_TYPE (scalar_loop_iters), th - 1)); split out this refactoring - preapproved. @@ -1726,7 +1729,13 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo) return 0; } - HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop); + HOST_WIDE_INT estimated_niter = -1; + + if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) + estimated_niter + = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1; + if (estimated_niter == -1) + estimated_niter = estimated_stmt_executions_int (loop); if (estimated_niter == -1) estimated_niter = likely_max_stmt_executions_int (loop); if (estimated_niter != -1 it's clearer if the old code is completely in a else {} path even though vect_vf_for_cost - 1 should never be -1. +/* Decides whether we need to create an epilogue loop to handle + remaining scalar iterations and sets PEELING_FOR_NITERS accordingly. */ + +void +determine_peel_for_niter (loop_vec_info loop_vinfo) +{ + extra vertical space + unsigned HOST_WIDE_INT const_vf; + HOST_WIDE_INT max_niter if it's a 1:1 copy outlined then split it out - preapproved (so further reviews get smaller patches ;)) I'd add a LOOP_VINFO_PEELING_FOR_NITER () = false as final else since that's what we do by default? - if (LOOP_REQUIRES_VERSIONING (loop_vinfo)) + if (LOOP_REQUIRES_VERSIONING (loop_vinfo) + || ((orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) + && LOOP_REQUIRES_VERSIONING (orig_loop_vinfo))) not sure why we need to do this for epilouges? + + /* Use the same condition as vect_transform_loop to decide when to use + the cost to determine a versioning threshold. */ + if (th >= vect_vf_for_cost (loop_vinfo) + && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) + && ordered_p (th, niters_th)) + niters_th = ordered_max (poly_uint64 (th), niters_th); that's an independent change, right? Please split out, it's pre-approved if it tests OK separately. +static tree +replace_ops (tree op, hash_map<tree, tree> &mapping) +{ I'm quite sure I've seen such beast elsewhere ;) simplify_replace_tree comes up first (not a 1:1 match but hints at a possible tree sharing issue in your variant). @@ -8497,11 +8588,11 @@ vect_transform_loop (loop_vec_info loop_vinfo) if (th >= vect_vf_for_cost (loop_vinfo) && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)) { - if (dump_enabled_p ()) - dump_printf_loc (MSG_NOTE, vect_location, - "Profitability threshold is %d loop iterations.\n", - th); - check_profitability = true; + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "Profitability threshold is %d loop iterations.\n", + th); + check_profitability = true; } /* Make sure there exists a single-predecessor exit bb. Do this before obvious (separate) + tree advance; epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector, &step_vector, &niters_vector_mult_vf, th, - check_profitability, niters_no_overflow); + check_profitability, niters_no_overflow, + &advance); + + if (epilogue) + { + basic_block *orig_bbs = get_loop_body (loop); + loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue); ... please record this in vect_do_peeling itself and store the orig_stmts/drs/etc. in the epilogue loop_vinfo and ... + /* We are done vectorizing the main loop, so now we update the epilogues + stmt_vec_info's. At the same time we set the gimple UID of each + statement in the epilogue, as these are used to look them up in the + epilogues loop_vec_info later. We also keep track of what ... split this out to a new function. I wonder why you need to record the DRs, are they not available via ->datarefs and lookup_dr ()? diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c index 601a6f55fbff388c89f88d994e790aebf2bf960e..201549da6c0cbae0797a23ae1b8967b9895505e9 100644 --- a/gcc/tree-vect-stmts.c +++ b/gcc/tree-vect-stmts.c @@ -6288,7 +6288,7 @@ ensure_base_align (dr_vec_info *dr_info) if (decl_in_symtab_p (base_decl)) symtab_node::get (base_decl)->increase_alignment (align_base_to); - else + else if (DECL_ALIGN (base_decl) < align_base_to) { SET_DECL_ALIGN (base_decl, align_base_to); DECL_USER_ALIGN (base_decl) = 1; split out - preapproved. Still have to go over the main loop doing the analysis/transform. Thanks, it looks really promising (albeit exepectedly ugly due to the data rewriting). Richard. > gcc/ChangeLog: > 2019-10-10 Andre Vieira <andre.simoesdiasvieira@arm.com> > > PR 88915 > * cfgloop.h (loop): Add epilogue_vsizes member. > * cfgloop.c (flow_loop_free): Release epilogue_vsizes. > (alloc_loop): Initialize epilogue_vsizes. > * gentype.c (main): Add poly_uint64 type and vector_sizes to > generator. > * tree-vect-loop.c (vect_get_loop_niters): Make externally visible. > (_loop_vec_info): Initialize epilogue_vinfos. > (~_loop_vec_info): Release epilogue_vinfos. > (vect_analyze_loop_costing): Use knowledge of main VF to estimate > number of iterations of epilogue. > (determine_peel_for_niter): New. Outlined code to re-use in two > places. > (vect_analyze_loop_2): Adapt to analyse main loop for all supported > vector sizes when vect-epilogues-nomask=1. Also keep track of lowest > versioning threshold needed for main loop. > (vect_analyze_loop): Likewise. > (replace_ops): New helper function. > (vect_transform_loop): When vectorizing epilogues re-use analysis done > on main loop and update necessary information. > * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert > stmts on loop preheader edge. > (vect_do_peeling): Enable skip-vectors when doing loop versioning if > we decided to vectorize epilogues. Update epilogues NITERS and > construct ADVANCE to update epilogues data references where needed. > (vect_loop_versioning): Moved decision to check_profitability > based on cost model. > * tree-vect-stmts.c (ensure_base_align): Only update alignment > if new alignment is lower. > * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos member. > (vect_loop_versioning, vect_do_peeling, vect_get_loop_niters, > vect_update_inits_of_drs, determine_peel_for_niter, > vect_analyze_loop): Add or update declarations. > * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already > create loop_vec_info's for epilogues when available. Otherwise analyse > epilogue separately. > > > > Cheers, > Andre >
Hi Richi, See inline responses to your comments. On 11/10/2019 13:57, Richard Biener wrote: > On Thu, 10 Oct 2019, Andre Vieira (lists) wrote: > >> Hi, >> > > + > + /* Keep track of vector sizes we know we can vectorize the epilogue > with. */ > + vector_sizes epilogue_vsizes; > }; > > please don't enlarge struct loop, instead track this somewhere > in the vectorizer (in loop_vinfo? I see you already have > epilogue_vinfos there - so the loop_vinfo simply lacks > convenient access to the vector_size?) I don't see any > use that could be trivially adjusted to look at a loop_vinfo > member instead. Done. > > For the vect_update_inits_of_drs this means that we'd possibly > do less CSE. Not sure if really an issue. CSE of what exactly? You are afraid we are repeating a calculation here we have done elsewhere before? > > You use LOOP_VINFO_EPILOGUE_P sometimes and sometimes > LOOP_VINFO_ORIG_LOOP_INFO, please change predicates to > LOOP_VINFO_EPILOGUE_P. I checked and the points where I use LOOP_VINFO_ORIG_LOOP_INFO is because I then use the resulting loop info. If there are cases you feel strongly about let me know. > > @@ -2466,15 +2461,62 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree > niters, tree nitersm1, > else > niters_prolog = build_int_cst (type, 0); > > + loop_vec_info epilogue_vinfo = NULL; > + if (vect_epilogues) > + { > ... > + vect_epilogues = false; > + } > + > > I don't understand what all this does - it clearly needs a comment. > Maybe the overall comment of the function should be amended with > an overview of how we handle [multiple] epilogue loop vectorization? I added more comments both here and on top of the function. Hopefully it is a bit clearer now, but it might need some tweaking. > > + > + if (epilogue_any_upper_bound && prolog_peeling >= 0) > + { > + epilog->any_upper_bound = true; > + epilog->nb_iterations_upper_bound = eiters + 1; > + } > + > > comment missing. How can prolog_peeling be < 0? We likely > didn't set the upper bound because we don't know it in the > case we skipped the vector loop (skip_vector)? So make sure > to not introduce wrong-code issues here - maybe do this > optimization as followup?n > So the reason for this code wasn't so much an optimization as it was for correctness. But I was mistaken, the failure I was seeing without this code was not because of this code, but rather being hidden by it. The problem I was seeing was that a prolog was being created using the original loop copy, rather than the scalar loop, leading to MASK_LOAD and MASK_STORE being left in the scalar prolog, leading to expand ICEs. I have fixed that issue by making sure the SCALAR_LOOP is used for prolog peeling and either the loop copy or SCALAR loop for epilogue peeling depending on whether we will be vectorizing said epilogue. > @@ -1726,7 +1729,13 @@ vect_analyze_loop_costing (loop_vec_info > loop_vinfo) > return 0; > } > > - HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop); > + HOST_WIDE_INT estimated_niter = -1; > + > + if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) > + estimated_niter > + = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1; > + if (estimated_niter == -1) > + estimated_niter = estimated_stmt_executions_int (loop); > if (estimated_niter == -1) > estimated_niter = likely_max_stmt_executions_int (loop); > if (estimated_niter != -1 > > it's clearer if the old code is completely in a else {} path > even though vect_vf_for_cost - 1 should never be -1. > Done for the == -1 cases, need to keep the != -1 outside of course. > - if (LOOP_REQUIRES_VERSIONING (loop_vinfo)) > + if (LOOP_REQUIRES_VERSIONING (loop_vinfo) > + || ((orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) > + && LOOP_REQUIRES_VERSIONING (orig_loop_vinfo))) > > not sure why we need to do this for epilouges? > This is because we want to compute the versioning threshold for epilogues such that we can use the minimum versioning threshold when versioning the main loop. The reason we need to ask we need to ask the original main loop is partially because of code in 'vect_analyze_data_ref_dependences' that chooses to not do DR dependence analysis and thus never fills LOOP_VINFO_MAY_ALIAS_DDRS for the epilogues loop_vinfo and as a consequence LOOP_VINFO_COMP_ALIAS_DDRS is always 0. The piece of code is preceded by this comment: /* For epilogues we either have no aliases or alias versioning was applied to original loop. Therefore we may just get max_vf using VF of original loop. */ I have added some comments to make it clearer. > > +static tree > +replace_ops (tree op, hash_map<tree, tree> &mapping) > +{ > > I'm quite sure I've seen such beast elsewhere ;) simplify_replace_tree > comes up first (not a 1:1 match but hints at a possible tree > sharing issue in your variant). > The reason I couldn't use simplify_replace_tree is because I didn't know what the "OLD" value is at the time I want to call it. Basically I want to check whether an SSA name is a key in MAPPING and if so replace it with the corresponding VALUE. I have changed simplify_replace_tree such that valueize can take a context parameter. I replaced one use of replace_ops with it and the other I specialized as I found that it was always a MEM_REF and we needed to replace the address it was dereferencing. > > + tree advance; > epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, > &niters_vector, > &step_vector, &niters_vector_mult_vf, th, > - check_profitability, niters_no_overflow); > + check_profitability, niters_no_overflow, > + &advance); > + > + if (epilogue) > + { > + basic_block *orig_bbs = get_loop_body (loop); > + loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue); > ... > > orig_stmts/drs/etc. in the epilogue loop_vinfo and ... > > + /* We are done vectorizing the main loop, so now we update the > epilogues > + stmt_vec_info's. At the same time we set the gimple UID of each > + statement in the epilogue, as these are used to look them up in > the > + epilogues loop_vec_info later. We also keep track of what > ... > > split this out to a new function. I wonder why you need to record > the DRs, are they not available via ->datarefs and lookup_dr ()? lookup_dr may no longer work at this point. I found that for some memory accesses by the time I got to this point, the DR_STMT of the data_reference pointed to a scalar statement that no longer existed and the lookup_dr to that data reference ICE's. I can't make this update before we transform the loop because the data references are shared, so I decided to capture the dr_vec_info's instead. Apparently we don't ever do a lookup_dr past this point, which I must admit is surprising. > Still have to go over the main loop doing the analysis/transform. > > Thanks, it looks really promising (albeit exepectedly ugly due to > the data rewriting). > Yeah, though I feel like now that I have put it away into functions it makes it look cleaner. That vect_transform_loop function was getting too big! Is this OK for trunk? gcc/ChangeLog: 2019-10-22 Andre Vieira <andre.simoesdiasvieira@arm.com> PR 88915 * gentype.c (main): Add poly_uint64 type and vector_sizes to generator. * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration. * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter and make the valueize function pointer also take a void pointer. * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap around vn_valueize, to call it without a context. (process_bb): Use vn_valueize_wrapper instead of vn_valueize. * tree-vect-loop.c (vect_get_loop_niters): Make externally visible. (_loop_vec_info): Initialize epilogue_vinfos. (~_loop_vec_info): Release epilogue_vinfos. (vect_analyze_loop_costing): Use knowledge of main VF to estimate number of iterations of epilogue. (vect_analyze_loop_2): Adapt to analyse main loop for all supported vector sizes when vect-epilogues-nomask=1. Also keep track of lowest versioning threshold needed for main loop. (vect_analyze_loop): Likewise. (find_in_mapping): New helper function. (update_epilogue_loop_vinfo): New function. (vect_transform_loop): When vectorizing epilogues re-use analysis done on main loop and call update_epilogue_loop_vinfo to update it. * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert stmts on loop preheader edge. (vect_do_peeling): Enable skip-vectors when doing loop versioning if we decided to vectorize epilogues. Update epilogues NITERS and construct ADVANCE to update epilogues data references where needed. * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos, epilogue_vsizes and update_epilogue_vinfo members. (LOOP_VINFO_UP_STMTS, LOOP_VINFO_UP_GT_DRS, LOOP_VINFO_UP_DRS, LOOP_VINFO_EPILOGUE_SIZES): Define MACROs. (vect_do_peeling, vect_get_loop_niters, vect_update_inits_of_drs, determine_peel_for_niter, vect_analyze_loop): Add or update declarations. * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already created loop_vec_info's for epilogues when available. Otherwise analyse epilogue separately.
On Tue, 22 Oct 2019, Andre Vieira (lists) wrote: > Hi Richi, > > See inline responses to your comments. > > On 11/10/2019 13:57, Richard Biener wrote: > > On Thu, 10 Oct 2019, Andre Vieira (lists) wrote: > > > >> Hi, > >> > > > > > + > > + /* Keep track of vector sizes we know we can vectorize the epilogue > > with. */ > > + vector_sizes epilogue_vsizes; > > }; > > > > please don't enlarge struct loop, instead track this somewhere > > in the vectorizer (in loop_vinfo? I see you already have > > epilogue_vinfos there - so the loop_vinfo simply lacks > > convenient access to the vector_size?) I don't see any > > use that could be trivially adjusted to look at a loop_vinfo > > member instead. > > Done. > > > > For the vect_update_inits_of_drs this means that we'd possibly > > do less CSE. Not sure if really an issue. > > CSE of what exactly? You are afraid we are repeating a calculation here we > have done elsewhere before? All uses of those inits now possibly get the expression instead of just the SSA name we inserted code for once. But as said, we'll see. > > > > You use LOOP_VINFO_EPILOGUE_P sometimes and sometimes > > LOOP_VINFO_ORIG_LOOP_INFO, please change predicates to > > LOOP_VINFO_EPILOGUE_P. > > I checked and the points where I use LOOP_VINFO_ORIG_LOOP_INFO is because I > then use the resulting loop info. If there are cases you feel strongly about > let me know. Not too strongly, no. > > > > @@ -2466,15 +2461,62 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree > > niters, tree nitersm1, > > else > > niters_prolog = build_int_cst (type, 0); > > > > + loop_vec_info epilogue_vinfo = NULL; > > + if (vect_epilogues) > > + { > > ... > > + vect_epilogues = false; > > + } > > + > > > > I don't understand what all this does - it clearly needs a comment. > > Maybe the overall comment of the function should be amended with > > an overview of how we handle [multiple] epilogue loop vectorization? > > I added more comments both here and on top of the function. Hopefully it is a > bit clearer now, but it might need some tweaking. > > > > > + > > + if (epilogue_any_upper_bound && prolog_peeling >= 0) > > + { > > + epilog->any_upper_bound = true; > > + epilog->nb_iterations_upper_bound = eiters + 1; > > + } > > + > > > > comment missing. How can prolog_peeling be < 0? We likely > > didn't set the upper bound because we don't know it in the > > case we skipped the vector loop (skip_vector)? So make sure > > to not introduce wrong-code issues here - maybe do this > > optimization as followup?n > > > > So the reason for this code wasn't so much an optimization as it was for > correctness. But I was mistaken, the failure I was seeing without this code > was not because of this code, but rather being hidden by it. The problem I was > seeing was that a prolog was being created using the original loop copy, > rather than the scalar loop, leading to MASK_LOAD and MASK_STORE being left in > the scalar prolog, leading to expand ICEs. I have fixed that issue by making > sure the SCALAR_LOOP is used for prolog peeling and either the loop copy or > SCALAR loop for epilogue peeling depending on whether we will be vectorizing > said epilogue. OK. > > > @@ -1726,7 +1729,13 @@ vect_analyze_loop_costing (loop_vec_info > > loop_vinfo) > > return 0; > > } > > > > - HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop); > > + HOST_WIDE_INT estimated_niter = -1; > > + > > + if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) > > + estimated_niter > > + = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1; > > + if (estimated_niter == -1) > > + estimated_niter = estimated_stmt_executions_int (loop); > > if (estimated_niter == -1) > > estimated_niter = likely_max_stmt_executions_int (loop); > > if (estimated_niter != -1 > > > > it's clearer if the old code is completely in a else {} path > > even though vect_vf_for_cost - 1 should never be -1. > > > Done for the == -1 cases, need to keep the != -1 outside of course. > > - if (LOOP_REQUIRES_VERSIONING (loop_vinfo)) > > + if (LOOP_REQUIRES_VERSIONING (loop_vinfo) > > + || ((orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) > > + && LOOP_REQUIRES_VERSIONING (orig_loop_vinfo))) > > > > not sure why we need to do this for epilouges? > > > > This is because we want to compute the versioning threshold for epilogues such > that we can use the minimum versioning threshold when versioning the main > loop. The reason we need to ask we need to ask the original main loop is > partially because of code in 'vect_analyze_data_ref_dependences' that chooses > to not do DR dependence analysis and thus never fills > LOOP_VINFO_MAY_ALIAS_DDRS for the epilogues loop_vinfo and as a consequence > LOOP_VINFO_COMP_ALIAS_DDRS is always 0. > > The piece of code is preceded by this comment: > /* For epilogues we either have no aliases or alias versioning > was applied to original loop. Therefore we may just get max_vf > using VF of original loop. */ > > I have added some comments to make it clearer. > > > > +static tree > > +replace_ops (tree op, hash_map<tree, tree> &mapping) > > +{ > > > > I'm quite sure I've seen such beast elsewhere ;) simplify_replace_tree > > comes up first (not a 1:1 match but hints at a possible tree > > sharing issue in your variant). > > > > The reason I couldn't use simplify_replace_tree is because I didn't know what > the "OLD" value is at the time I want to call it. Basically I want to check > whether an SSA name is a key in MAPPING and if so replace it with the > corresponding VALUE. > > I have changed simplify_replace_tree such that valueize can take a context > parameter. I replaced one use of replace_ops with it and the other I > specialized as I found that it was always a MEM_REF and we needed to replace > the address it was dereferencing. > > > > > + tree advance; > > epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, > > &niters_vector, > > &step_vector, &niters_vector_mult_vf, th, > > - check_profitability, niters_no_overflow); > > + check_profitability, niters_no_overflow, > > + &advance); > > + > > + if (epilogue) > > + { > > + basic_block *orig_bbs = get_loop_body (loop); > > + loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue); > > ... > > > > orig_stmts/drs/etc. in the epilogue loop_vinfo and ... > > > > + /* We are done vectorizing the main loop, so now we update the > > epilogues > > + stmt_vec_info's. At the same time we set the gimple UID of each > > + statement in the epilogue, as these are used to look them up in > > the > > + epilogues loop_vec_info later. We also keep track of what > > ... > > > > split this out to a new function. I wonder why you need to record > > the DRs, are they not available via ->datarefs and lookup_dr ()? > > lookup_dr may no longer work at this point. I found that for some memory > accesses by the time I got to this point, the DR_STMT of the data_reference > pointed to a scalar statement that no longer existed and the lookup_dr to that > data reference ICE's. I can't make this update before we transform the loop > because the data references are shared, so I decided to capture the > dr_vec_info's instead. Apparently we don't ever do a lookup_dr past this > point, which I must admit is surprising. OK, as long as this fixup code is well isolated we can see how to make it prettier later ;) But yes, we have some vectorizer transforms that remove old stmts (bad). At least that's true for stores, we could probably delay actual (scalar) stmt removal until the whole series of loop + epilogue vectorization is finished. As said, let's try as followup. > > Still have to go over the main loop doing the analysis/transform. > > > > Thanks, it looks really promising (albeit exepectedly ugly due to > > the data rewriting). > > > > Yeah, though I feel like now that I have put it away into functions it makes > it look cleaner. That vect_transform_loop function was getting too big! > > Is this OK for trunk? You probably no longer need the gentype.c hunk. +} + +static void +update_epilogue_loop_vinfo (class loop *epilogue, tree advance) function comment missing + + + /* We are done vectorizing the main loop, so now we update the epilogues too much vertical space. + /* We are done vectorizing the main loop, so now we update the epilogues + stmt_vec_info's. At the same time we set the gimple UID of each + statement in the epilogue, as these are used to look them up in the + epilogues loop_vec_info later. We also keep track of what + stmt_vec_info's have PATTERN_DEF_SEQ's and RELATED_STMT's that might + need updating and we construct a mapping between variables defined in + the main loop and their corresponding names in epilogue. */ + for (unsigned i = 0; i < epilogue->num_nodes; ++i) so for the following code I wonder if you can make use of the fact that loop copying also copies UIDs, so you should be able to match stmts via their UIDs and get at the other loop infos stmt_info by the copy loop stmt UID. I wonder why you need no modification for the SLP tree? Otherwise the patch looks OK. Thanks, Richard. > gcc/ChangeLog: > 2019-10-22 Andre Vieira <andre.simoesdiasvieira@arm.com> > > PR 88915 > * gentype.c (main): Add poly_uint64 type and vector_sizes to > generator. > * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration. > * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter > and make the valueize function pointer also take a void pointer. > * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap > around vn_valueize, to call it without a context. > (process_bb): Use vn_valueize_wrapper instead of vn_valueize. > * tree-vect-loop.c (vect_get_loop_niters): Make externally visible. > (_loop_vec_info): Initialize epilogue_vinfos. > (~_loop_vec_info): Release epilogue_vinfos. > (vect_analyze_loop_costing): Use knowledge of main VF to estimate > number of iterations of epilogue. > (vect_analyze_loop_2): Adapt to analyse main loop for all supported > vector sizes when vect-epilogues-nomask=1. Also keep track of lowest > versioning threshold needed for main loop. > (vect_analyze_loop): Likewise. > (find_in_mapping): New helper function. > (update_epilogue_loop_vinfo): New function. > (vect_transform_loop): When vectorizing epilogues re-use analysis done > on main loop and call update_epilogue_loop_vinfo to update it. > * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert > stmts on loop preheader edge. > (vect_do_peeling): Enable skip-vectors when doing loop versioning if > we decided to vectorize epilogues. Update epilogues NITERS and > construct ADVANCE to update epilogues data references where needed. > * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos, > epilogue_vsizes and update_epilogue_vinfo members. > (LOOP_VINFO_UP_STMTS, LOOP_VINFO_UP_GT_DRS, LOOP_VINFO_UP_DRS, > LOOP_VINFO_EPILOGUE_SIZES): Define MACROs. > (vect_do_peeling, vect_get_loop_niters, vect_update_inits_of_drs, > determine_peel_for_niter, vect_analyze_loop): Add or update declarations. > * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already > created loop_vec_info's for epilogues when available. Otherwise > analyse > epilogue separately. >
Thanks for doing this. Hope this message doesn't cover too much old ground or duplicate too much... "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes: > @@ -2466,15 +2476,65 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, > else > niters_prolog = build_int_cst (type, 0); > > + loop_vec_info epilogue_vinfo = NULL; > + if (vect_epilogues) > + { > + /* Take the next epilogue_vinfo to vectorize for. */ > + epilogue_vinfo = loop_vinfo->epilogue_vinfos[0]; > + loop_vinfo->epilogue_vinfos.ordered_remove (0); > + > + /* Don't vectorize epilogues if this is not the most inner loop or if > + the epilogue may need peeling for alignment as the vectorizer doesn't > + know how to handle these situations properly yet. */ > + if (loop->inner != NULL > + || LOOP_VINFO_PEELING_FOR_ALIGNMENT (epilogue_vinfo)) > + vect_epilogues = false; > + > + } Nit: excess blank line before "}". Sorry if this was discussed before, but what's the reason for delaying the check for "loop->inner" to this point, rather than doing it in vect_analyze_loop? > + > + tree niters_vector_mult_vf; > + unsigned int lowest_vf = constant_lower_bound (vf); > + /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work > + on niters already ajusted for the iterations of the prologue. */ Pre-existing typo: adjusted. But... > + if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) > + && known_eq (vf, lowest_vf)) > + { > + loop_vec_info orig_loop_vinfo; > + if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)) > + orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo); > + else > + orig_loop_vinfo = loop_vinfo; > + vector_sizes vector_sizes = LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo); > + unsigned next_size = 0; > + unsigned HOST_WIDE_INT eiters > + = (LOOP_VINFO_INT_NITERS (loop_vinfo) > + - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)); > + > + if (prolog_peeling > 0) > + eiters -= prolog_peeling; ...is that comment still true? We're now subtracting the peeling amount here. Might be worth asserting prolog_peeling >= 0, just to emphasise that we can't get here for variable peeling amounts, and then subtract prolog_peeling unconditionally (assuming that's the right thing to do). > + eiters > + = eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo); > + > + unsigned int ratio; > + while (next_size < vector_sizes.length () > + && !(constant_multiple_p (current_vector_size, > + vector_sizes[next_size], &ratio) > + && eiters >= lowest_vf / ratio)) > + next_size += 1; > + > + if (next_size == vector_sizes.length ()) > + vect_epilogues = false; > + } > + > /* Prolog loop may be skipped. */ > bool skip_prolog = (prolog_peeling != 0); > /* Skip to epilog if scalar loop may be preferred. It's only needed > - when we peel for epilog loop and when it hasn't been checked with > - loop versioning. */ > + when we peel for epilog loop or when we loop version. */ > bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) > ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo), > bound_prolog + bound_epilog) > - : !LOOP_REQUIRES_VERSIONING (loop_vinfo)); > + : (!LOOP_REQUIRES_VERSIONING (loop_vinfo) > + || vect_epilogues)); The comment update looks wrong here: without epilogues, we don't need the skip when loop versioning, because loop versioning ensures that we have at least one vector iteration. (I think "it" was supposed to mean "skipping to the epilogue" rather than the epilogue loop itself, in case that's the confusion.) It'd be good to mention the epilogue condition in the comment too. > @@ -2504,6 +2564,13 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, > > dump_user_location_t loop_loc = find_loop_location (loop); > class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo); > + if (vect_epilogues) > + /* Make sure to set the epilogue's epilogue scalar loop, such that we can > + we can use the original scalar loop as remaining epilogue if > + necessary. */ Double "we can". > + LOOP_VINFO_SCALAR_LOOP (epilogue_vinfo) > + = LOOP_VINFO_SCALAR_LOOP (loop_vinfo); > + > if (prolog_peeling) > { > e = loop_preheader_edge (loop); > @@ -2584,14 +2651,22 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, > "loop can't be duplicated to exit edge.\n"); > gcc_unreachable (); > } > - /* Peel epilog and put it on exit edge of loop. */ > - epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e); > + /* Peel epilog and put it on exit edge of loop. If we are vectorizing > + said epilog then we should use a copy of the main loop as a starting > + point. This loop may have been already had some preliminary s/been// > + transformations to allow for more optimal vectorizationg, for example typo: vectorizationg > + if-conversion. If we are not vectorizing the epilog then we should > + use the scalar loop as the transformations mentioned above make less > + or no sense when not vectorizing. */ > + epilog = vect_epilogues ? get_loop_copy (loop) : scalar_loop; > + epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e); > if (!epilog) > { > dump_printf_loc (MSG_MISSED_OPTIMIZATION, loop_loc, > "slpeel_tree_duplicate_loop_to_edge_cfg failed.\n"); > gcc_unreachable (); > } > + > epilog->force_vectorize = false; > slpeel_update_phi_nodes_for_loops (loop_vinfo, loop, epilog, false); > > [...] > @@ -2699,10 +2774,163 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, > adjust_vec_debug_stmts (); > scev_reset (); > } > + > + if (vect_epilogues) > + { > + epilog->aux = epilogue_vinfo; > + LOOP_VINFO_LOOP (epilogue_vinfo) = epilog; > + > + loop_constraint_clear (epilog, LOOP_C_INFINITE); > + > + /* We now must calculate the number of iterations for our epilogue. */ > + tree cond_niters, niters; > + > + /* Depending on whether we peel for gaps we take niters or niters - 1, > + we will refer to this as N - G, where N and G are the NITERS and > + GAP for the original loop. */ > + niters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) > + ? LOOP_VINFO_NITERSM1 (loop_vinfo) > + : LOOP_VINFO_NITERS (loop_vinfo); > + > + /* Here we build a vector factorization mask: > + vf_mask = ~(VF - 1), where VF is the Vectorization Factor. */ > + tree vf_mask = build_int_cst (TREE_TYPE (niters), > + LOOP_VINFO_VECT_FACTOR (loop_vinfo)); > + vf_mask = fold_build2 (MINUS_EXPR, TREE_TYPE (vf_mask), > + vf_mask, > + build_one_cst (TREE_TYPE (vf_mask))); > + vf_mask = fold_build1 (BIT_NOT_EXPR, TREE_TYPE (niters), vf_mask); > + > + /* Here we calculate: > + niters = N - ((N-G) & ~(VF -1)) */ > + niters = fold_build2 (MINUS_EXPR, TREE_TYPE (niters), > + LOOP_VINFO_NITERS (loop_vinfo), > + fold_build2 (BIT_AND_EXPR, TREE_TYPE (niters), > + niters, > + vf_mask)); Might be a daft question, sorry, but why does this need to be so complicated? Couldn't we just use the final value of the main loop's IV to calculate how many iterations are left? The current code wouldn't for example work for non-power-of-2 SVE vectors. vect_set_loop_condition_unmasked is structured to cope with that case (in length-agnostic mode only), even when an epilogue is needed. > [...] > - return epilog; > + if (vect_epilogues) > + { > + basic_block *bbs = get_loop_body (loop); > + loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilog); > + > + LOOP_VINFO_UP_STMTS (epilogue_vinfo).create (0); > + LOOP_VINFO_UP_GT_DRS (epilogue_vinfo).create (0); > + LOOP_VINFO_UP_DRS (epilogue_vinfo).create (0); > + > + gimple_stmt_iterator gsi; > + gphi_iterator phi_gsi; > + gimple *stmt; > + stmt_vec_info stmt_vinfo; > + dr_vec_info *dr_vinfo; > + > + /* The stmt_vec_info's of the epilogue were constructed for the main loop > + and need to be updated to refer to the cloned variables used in the > + epilogue loop. We do this by assuming the original main loop and the > + epilogue loop are identical (aside the different SSA names). This > + means we assume we can go through each BB in the loop and each STMT in > + each BB and map them 1:1, replacing the STMT_VINFO_STMT of each > + stmt_vec_info in the epilogue's loop_vec_info. Here we only keep > + track of the original state of the main loop, before vectorization. > + After vectorization we proceed to update the epilogue's stmt_vec_infos > + information. We also update the references in PATTERN_DEF_SEQ's, > + RELATED_STMT's and data_references. Mainly the latter has to be > + updated after we are done vectorizing the main loop, as the > + data_references are shared between main and epilogue. */ > + for (unsigned i = 0; i < loop->num_nodes; ++i) > + { > + for (phi_gsi = gsi_start_phis (bbs[i]); > + !gsi_end_p (phi_gsi); gsi_next (&phi_gsi)) > + LOOP_VINFO_UP_STMTS (epilogue_vinfo).safe_push (phi_gsi.phi ()); > + for (gsi = gsi_start_bb (bbs[i]); > + !gsi_end_p (gsi); gsi_next (&gsi)) > + { > + stmt = gsi_stmt (gsi); > + LOOP_VINFO_UP_STMTS (epilogue_vinfo).safe_push (stmt); > + stmt_vinfo = epilogue_vinfo->lookup_stmt (stmt); Nit: double space before "=". > + if (stmt_vinfo != NULL > + && stmt_vinfo->dr_aux.stmt == stmt_vinfo) > + { > + dr_vinfo = STMT_VINFO_DR_INFO (stmt_vinfo); > + /* Data references pointing to gather loads and scatter stores > + require special treatment because the address computation > + happens in a different gimple node, pointed to by DR_REF. > + In contrast to normal loads and stores where we only need > + to update the offset of the data reference. */ > + if (STMT_VINFO_GATHER_SCATTER_P (dr_vinfo->stmt)) > + LOOP_VINFO_UP_GT_DRS (epilogue_vinfo).safe_push (dr_vinfo); > + LOOP_VINFO_UP_DRS (epilogue_vinfo).safe_push (dr_vinfo); > + } > + } > + } > + } > + > + return vect_epilogues ? epilog : NULL; > } > > /* Function vect_create_cond_for_niters_checks. > [...] > @@ -2151,8 +2176,18 @@ start_over: > /* During peeling, we need to check if number of loop iterations is > enough for both peeled prolog loop and vector loop. This check > can be merged along with threshold check of loop versioning, so > - increase threshold for this case if necessary. */ > - if (LOOP_REQUIRES_VERSIONING (loop_vinfo)) > + increase threshold for this case if necessary. > + > + If we are analyzing an epilogue we still want to check what it's s/it's/its/ > + versioning threshold would be. If we decide to vectorize the epilogues we > + will want to use the lowest versioning threshold of all epilogues and main > + loop. This will enable us to enter a vectorized epilogue even when > + versioning the loop. We can't simply check whether the epilogue requires > + versioning though since we may have skipped some versioning checks when > + analyzing the epilogue. For instance, checks for alias versioning will be Nit: should be two spaces after ".". > + skipped when dealing with epilogues as we assume we already checked them > + for the main loop. So instead we always check the 'orig_loop_vinfo'. */ > + if (LOOP_REQUIRES_VERSIONING (orig_loop_vinfo)) > { > poly_uint64 niters_th = 0; > unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo); > @@ -2307,14 +2342,8 @@ again: > be vectorized. */ > opt_loop_vec_info > vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo, > - vec_info_shared *shared) > + vec_info_shared *shared, vector_sizes vector_sizes) > { > - auto_vector_sizes vector_sizes; > - > - /* Autodetect first vector size we try. */ > - current_vector_size = 0; > - targetm.vectorize.autovectorize_vector_sizes (&vector_sizes, > - loop->simdlen != 0); > unsigned int next_size = 0; > > DUMP_VECT_SCOPE ("analyze_loop_nest"); > @@ -2335,6 +2364,9 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo, > poly_uint64 autodetected_vector_size = 0; > opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL); > poly_uint64 first_vector_size = 0; > + poly_uint64 lowest_th = 0; > + unsigned vectorized_loops = 0; > + bool vect_epilogues = !loop->simdlen && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK); > while (1) > { > /* Check the CFG characteristics of the loop (nesting, entry/exit). */ > @@ -2353,24 +2385,52 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo, > > if (orig_loop_vinfo) > LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = orig_loop_vinfo; > + else if (vect_epilogues && first_loop_vinfo) > + LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo; > > opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts); > if (res) > { > LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1; > + vectorized_loops++; > > - if (loop->simdlen > - && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo), > - (unsigned HOST_WIDE_INT) loop->simdlen)) > + if ((loop->simdlen > + && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo), > + (unsigned HOST_WIDE_INT) loop->simdlen)) > + || vect_epilogues) > { > if (first_loop_vinfo == NULL) > { > first_loop_vinfo = loop_vinfo; > + lowest_th > + = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo); > first_vector_size = current_vector_size; > loop->aux = NULL; > } > else > - delete loop_vinfo; > + { > + /* Keep track of vector sizes that we know we can vectorize > + the epilogue with. */ > + if (vect_epilogues) > + { > + loop->aux = NULL; > + first_loop_vinfo->epilogue_vsizes.reserve (1); > + first_loop_vinfo->epilogue_vsizes.quick_push (current_vector_size); > + first_loop_vinfo->epilogue_vinfos.reserve (1); > + first_loop_vinfo->epilogue_vinfos.quick_push (loop_vinfo); I've messed you around, sorry, but the patches I committed this weekend mean we now store the vector size in the loop_vinfo. It'd be good to avoid a separate epilogue_vsizes array if possible. > + LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo; > + poly_uint64 th > + = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo); > + gcc_assert (!LOOP_REQUIRES_VERSIONING (loop_vinfo) > + || maybe_ne (lowest_th, 0U)); > + /* Keep track of the known smallest versioning > + threshold. */ > + if (ordered_p (lowest_th, th)) > + lowest_th = ordered_min (lowest_th, th); > + } > + else > + delete loop_vinfo; > + } > } > else > { > @@ -2408,6 +2468,8 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo, > dump_dec (MSG_NOTE, current_vector_size); > dump_printf (MSG_NOTE, "\n"); > } > + LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo) = lowest_th; > + > return first_loop_vinfo; > } > else > @@ -8128,6 +8190,188 @@ vect_transform_loop_stmt (loop_vec_info loop_vinfo, stmt_vec_info stmt_info, > *seen_store = stmt_info; > } > > +/* Helper function to pass to simplify_replace_tree to enable replacing tree's > + in the hash_map with its corresponding values. */ > +static tree > +find_in_mapping (tree t, void *context) > +{ > + hash_map<tree,tree>* mapping = (hash_map<tree, tree>*) context; > + > + tree *value = mapping->get (t); > + return value ? *value : t; > +} > + > +static void > +update_epilogue_loop_vinfo (class loop *epilogue, tree advance) > +{ > + loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue); > + auto_vec<stmt_vec_info> pattern_worklist, related_worklist; > + hash_map<tree,tree> mapping; > + gimple *orig_stmt, *new_stmt; > + gimple_stmt_iterator epilogue_gsi; > + gphi_iterator epilogue_phi_gsi; > + stmt_vec_info stmt_vinfo = NULL, related_vinfo; > + basic_block *epilogue_bbs = get_loop_body (epilogue); > + > + LOOP_VINFO_BBS (epilogue_vinfo) = epilogue_bbs; > + > + vect_update_inits_of_drs (epilogue_vinfo, advance, PLUS_EXPR); > + > + > + /* We are done vectorizing the main loop, so now we update the epilogues > + stmt_vec_info's. At the same time we set the gimple UID of each "epilogue's stmt_vec_infos" > + statement in the epilogue, as these are used to look them up in the > + epilogues loop_vec_info later. We also keep track of what epilogue's > + stmt_vec_info's have PATTERN_DEF_SEQ's and RELATED_STMT's that might PATTERN_DEF_SEQs and RELATED_STMTs > + need updating and we construct a mapping between variables defined in > + the main loop and their corresponding names in epilogue. */ > + for (unsigned i = 0; i < epilogue->num_nodes; ++i) > + { > + for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]); > + !gsi_end_p (epilogue_phi_gsi); gsi_next (&epilogue_phi_gsi)) > + { > + orig_stmt = LOOP_VINFO_UP_STMTS (epilogue_vinfo)[0]; > + LOOP_VINFO_UP_STMTS (epilogue_vinfo).ordered_remove (0); > + new_stmt = epilogue_phi_gsi.phi (); > + > + stmt_vinfo > + = epilogue_vinfo->lookup_stmt (orig_stmt); Nit: fits one line. > + > + STMT_VINFO_STMT (stmt_vinfo) = new_stmt; > + gimple_set_uid (new_stmt, gimple_uid (orig_stmt)); > + > + mapping.put (gimple_phi_result (orig_stmt), > + gimple_phi_result (new_stmt)); Nit: indented too far. > + > + if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo)) > + pattern_worklist.safe_push (stmt_vinfo); > + > + related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo); > + while (related_vinfo && related_vinfo != stmt_vinfo) > + { > + related_worklist.safe_push (related_vinfo); > + /* Set BB such that the assert in > + 'get_initial_def_for_reduction' is able to determine that > + the BB of the related stmt is inside this loop. */ > + gimple_set_bb (STMT_VINFO_STMT (related_vinfo), > + gimple_bb (new_stmt)); > + related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo); > + } > + } > + > + for (epilogue_gsi = gsi_start_bb (epilogue_bbs[i]); > + !gsi_end_p (epilogue_gsi); gsi_next (&epilogue_gsi)) > + { > + orig_stmt = LOOP_VINFO_UP_STMTS (epilogue_vinfo)[0]; > + LOOP_VINFO_UP_STMTS (epilogue_vinfo).ordered_remove (0); > + new_stmt = gsi_stmt (epilogue_gsi); > + > + stmt_vinfo > + = epilogue_vinfo->lookup_stmt (orig_stmt); Fits on one line. > + > + STMT_VINFO_STMT (stmt_vinfo) = new_stmt; > + gimple_set_uid (new_stmt, gimple_uid (orig_stmt)); > + > + if (is_gimple_assign (orig_stmt)) > + { > + gcc_assert (is_gimple_assign (new_stmt)); > + mapping.put (gimple_assign_lhs (orig_stmt), > + gimple_assign_lhs (new_stmt)); > + } Why just assigns? Don't we need to handle calls too? Maybe just use gimple_get_lhs here. > + > + if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo)) > + pattern_worklist.safe_push (stmt_vinfo); > + > + related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo); > + related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo); > + while (related_vinfo && related_vinfo != stmt_vinfo) > + { > + related_worklist.safe_push (related_vinfo); > + /* Set BB such that the assert in > + 'get_initial_def_for_reduction' is able to determine that > + the BB of the related stmt is inside this loop. */ > + gimple_set_bb (STMT_VINFO_STMT (related_vinfo), > + gimple_bb (new_stmt)); > + related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo); > + } > + } > + gcc_assert (LOOP_VINFO_UP_STMTS (epilogue_vinfo).length () == 0); > + } > + > + /* The PATTERN_DEF_SEQ's in the epilogue were constructed using the PATTERN_DEF_SEQs > + original main loop and thus need to be updated to refer to the cloned > + variables used in the epilogue. */ > + for (unsigned i = 0; i < pattern_worklist.length (); ++i) > + { > + gimple_seq seq = STMT_VINFO_PATTERN_DEF_SEQ (pattern_worklist[i]); > + tree *new_op; > + > + while (seq) > + { > + for (unsigned j = 1; j < gimple_num_ops (seq); ++j) > + { > + tree op = gimple_op (seq, j); > + if ((new_op = mapping.get(op))) > + gimple_set_op (seq, j, *new_op); > + else > + { > + op = simplify_replace_tree (op, NULL_TREE, NULL_TREE, > + &find_in_mapping, &mapping); > + gimple_set_op (seq, j, op); > + } > + } > + seq = seq->next; > + } > + } > + > + /* Just like the PATTERN_DEF_SEQ's the RELATED_STMT's also need to be as above > + updated. */ > + for (unsigned i = 0; i < related_worklist.length (); ++i) > + { > + tree *new_t; > + gimple * stmt = STMT_VINFO_STMT (related_worklist[i]); > + for (unsigned j = 1; j < gimple_num_ops (stmt); ++j) > + if ((new_t = mapping.get(gimple_op (stmt, j)))) These days I think: if (tree *new_t = mapping.get(gimple_op (stmt, j))) is preferred. > + gimple_set_op (stmt, j, *new_t); > + } > + > + tree *new_op; > + /* Data references for gather loads and scatter stores do not use the > + updated offset we set using ADVANCE. Instead we have to make sure the > + reference in the data references point to the corresponding copy of > + the original in the epilogue. */ > + for (unsigned i = 0; i < LOOP_VINFO_UP_GT_DRS (epilogue_vinfo).length (); ++i) > + { > + dr_vec_info *dr_vinfo = LOOP_VINFO_UP_GT_DRS (epilogue_vinfo)[i]; > + data_reference *dr = dr_vinfo->dr; > + gcc_assert (dr); > + gcc_assert (TREE_CODE (DR_REF (dr)) == MEM_REF); > + new_op = mapping.get (TREE_OPERAND (DR_REF (dr), 0)); > + > + if (new_op) Likewise: if (tree *new_op = mapping.get (TREE_OPERAND (DR_REF (dr), 0))) here. > + { > + DR_REF (dr) = unshare_expr (DR_REF (dr)); > + TREE_OPERAND (DR_REF (dr), 0) = *new_op; > + DR_STMT (dr_vinfo->dr) = SSA_NAME_DEF_STMT (*new_op); > + } > + } > + > + /* The vector size of the epilogue is smaller than that of the main loop > + so the alignment is either the same or lower. This means the dr will > + thus by definition be aligned. */ > + for (unsigned i = 0; i < LOOP_VINFO_UP_DRS (epilogue_vinfo).length (); ++i) > + LOOP_VINFO_UP_DRS (epilogue_vinfo)[i]->base_misaligned = false; > + > + > + LOOP_VINFO_UP_STMTS (epilogue_vinfo).release (); > + LOOP_VINFO_UP_GT_DRS (epilogue_vinfo).release (); > + LOOP_VINFO_UP_DRS (epilogue_vinfo).release (); > + > + epilogue_vinfo->shared->datarefs_copy.release (); > + epilogue_vinfo->shared->save_datarefs (); > +} > + > + > /* Function vect_transform_loop. > > The analysis phase has determined that the loop is vectorizable. > [...] > @@ -882,10 +886,35 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab, > LOCATION_FILE (vect_location.get_location_t ()), > LOCATION_LINE (vect_location.get_location_t ())); > > + /* If this is an epilogue, we already know what vector sizes we will use for > + vectorization as the analyzis was part of the main vectorized loop. Use > + these instead of going through all vector sizes again. */ > + if (orig_loop_vinfo > + && !LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo).is_empty ()) > + { > + vector_sizes = LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo); > + assert_versioning = LOOP_REQUIRES_VERSIONING (orig_loop_vinfo); > + current_vector_size = vector_sizes[0]; > + } > + else > + { > + /* Autodetect first vector size we try. */ > + current_vector_size = 0; > + > + targetm.vectorize.autovectorize_vector_sizes (&auto_vector_sizes, > + loop->simdlen != 0); > + vector_sizes = auto_vector_sizes; > + } > + > /* Try to analyze the loop, retaining an opt_problem if dump_enabled_p. */ > - opt_loop_vec_info loop_vinfo > - = vect_analyze_loop (loop, orig_loop_vinfo, &shared); > - loop->aux = loop_vinfo; > + opt_loop_vec_info loop_vinfo = opt_loop_vec_info::success (NULL); > + if (loop_vec_info_for_loop (loop)) > + loop_vinfo = opt_loop_vec_info::success (loop_vec_info_for_loop (loop)); > + else > + { > + loop_vinfo = vect_analyze_loop (loop, orig_loop_vinfo, &shared, vector_sizes); > + loop->aux = loop_vinfo; > + } I don't really understand what this is doing for the epilogue case. Do we call vect_analyze_loop again? Are vector_sizes[1:] significant for epilogues? Thanks, Richard
On 22/10/2019 14:56, Richard Biener wrote: > On Tue, 22 Oct 2019, Andre Vieira (lists) wrote: > >> Hi Richi, >> >> See inline responses to your comments. >> >> On 11/10/2019 13:57, Richard Biener wrote: >>> On Thu, 10 Oct 2019, Andre Vieira (lists) wrote: >>> >>>> Hi, >>>> >> >>> >>> + >>> + /* Keep track of vector sizes we know we can vectorize the epilogue >>> with. */ >>> + vector_sizes epilogue_vsizes; >>> }; >>> >>> please don't enlarge struct loop, instead track this somewhere >>> in the vectorizer (in loop_vinfo? I see you already have >>> epilogue_vinfos there - so the loop_vinfo simply lacks >>> convenient access to the vector_size?) I don't see any >>> use that could be trivially adjusted to look at a loop_vinfo >>> member instead. >> >> Done. >>> >>> For the vect_update_inits_of_drs this means that we'd possibly >>> do less CSE. Not sure if really an issue. >> >> CSE of what exactly? You are afraid we are repeating a calculation here we >> have done elsewhere before? > > All uses of those inits now possibly get the expression instead of > just the SSA name we inserted code for once. But as said, we'll see. > This code changed after some comments from Richard Sandiford. > + /* We are done vectorizing the main loop, so now we update the > epilogues > + stmt_vec_info's. At the same time we set the gimple UID of each > + statement in the epilogue, as these are used to look them up in the > + epilogues loop_vec_info later. We also keep track of what > + stmt_vec_info's have PATTERN_DEF_SEQ's and RELATED_STMT's that might > + need updating and we construct a mapping between variables defined > in > + the main loop and their corresponding names in epilogue. */ > + for (unsigned i = 0; i < epilogue->num_nodes; ++i) > > so for the following code I wonder if you can make use of the > fact that loop copying also copies UIDs, so you should be able > to match stmts via their UIDs and get at the other loop infos > stmt_info by the copy loop stmt UID. > > I wonder why you need no modification for the SLP tree? > I checked with Tamar and the SLP tree works with the position of operands and not SSA_NAMES. So we should be fine.
On 22/10/2019 18:52, Richard Sandiford wrote: > Thanks for doing this. Hope this message doesn't cover too much old > ground or duplicate too much... > > "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes: >> @@ -2466,15 +2476,65 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, >> else >> niters_prolog = build_int_cst (type, 0); >> >> + loop_vec_info epilogue_vinfo = NULL; >> + if (vect_epilogues) >> + { >> + /* Take the next epilogue_vinfo to vectorize for. */ >> + epilogue_vinfo = loop_vinfo->epilogue_vinfos[0]; >> + loop_vinfo->epilogue_vinfos.ordered_remove (0); >> + >> + /* Don't vectorize epilogues if this is not the most inner loop or if >> + the epilogue may need peeling for alignment as the vectorizer doesn't >> + know how to handle these situations properly yet. */ >> + if (loop->inner != NULL >> + || LOOP_VINFO_PEELING_FOR_ALIGNMENT (epilogue_vinfo)) >> + vect_epilogues = false; >> + >> + } > > Nit: excess blank line before "}". Sorry if this was discussed before, > but what's the reason for delaying the check for "loop->inner" to > this point, rather than doing it in vect_analyze_loop? Done. > >> + >> + tree niters_vector_mult_vf; >> + unsigned int lowest_vf = constant_lower_bound (vf); >> + /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work >> + on niters already ajusted for the iterations of the prologue. */ > > Pre-existing typo: adjusted. But... > >> + if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) >> + && known_eq (vf, lowest_vf)) >> + { >> + loop_vec_info orig_loop_vinfo; >> + if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)) >> + orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo); >> + else >> + orig_loop_vinfo = loop_vinfo; >> + vector_sizes vector_sizes = LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo); >> + unsigned next_size = 0; >> + unsigned HOST_WIDE_INT eiters >> + = (LOOP_VINFO_INT_NITERS (loop_vinfo) >> + - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)); >> + >> + if (prolog_peeling > 0) >> + eiters -= prolog_peeling; > > ...is that comment still true? We're now subtracting the peeling > amount here. It is not, "adjusted" the comment ;) > Might be worth asserting prolog_peeling >= 0, just to emphasise > that we can't get here for variable peeling amounts, and then subtract > prolog_peeling unconditionally (assuming that's the right thing to do). > Can't assert as LOOP_VINFO_NITERS_KNOWN_P can be true even with prolog_peeling < 0, since we still know the constant number of scalar iterations, we just don't know how many vector iterations will be performed due to the runtime peeling. I will however, not reject vectorizing the epilogue, when we don't know how much we are peeling. >> + eiters >> + = eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo); >> + >> + unsigned int ratio; >> + while (next_size < vector_sizes.length () >> + && !(constant_multiple_p (current_vector_size, >> + vector_sizes[next_size], &ratio) >> + && eiters >= lowest_vf / ratio)) >> + next_size += 1; >> + >> + if (next_size == vector_sizes.length ()) >> + vect_epilogues = false; >> + } >> + >> /* Prolog loop may be skipped. */ >> bool skip_prolog = (prolog_peeling != 0); >> /* Skip to epilog if scalar loop may be preferred. It's only needed >> - when we peel for epilog loop and when it hasn't been checked with >> - loop versioning. */ >> + when we peel for epilog loop or when we loop version. */ >> bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) >> ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo), >> bound_prolog + bound_epilog) >> - : !LOOP_REQUIRES_VERSIONING (loop_vinfo)); >> + : (!LOOP_REQUIRES_VERSIONING (loop_vinfo) >> + || vect_epilogues)); > > The comment update looks wrong here: without epilogues, we don't need > the skip when loop versioning, because loop versioning ensures that we > have at least one vector iteration. > > (I think "it" was supposed to mean "skipping to the epilogue" rather > than the epilogue loop itself, in case that's the confusion.) > > It'd be good to mention the epilogue condition in the comment too. > Rewrote comment, hopefully this now better reflects reality. >> + >> + if (vect_epilogues) >> + { >> + epilog->aux = epilogue_vinfo; >> + LOOP_VINFO_LOOP (epilogue_vinfo) = epilog; >> + >> + loop_constraint_clear (epilog, LOOP_C_INFINITE); >> + >> + /* We now must calculate the number of iterations for our epilogue. */ >> + tree cond_niters, niters; >> + >> + /* Depending on whether we peel for gaps we take niters or niters - 1, >> + we will refer to this as N - G, where N and G are the NITERS and >> + GAP for the original loop. */ >> + niters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) >> + ? LOOP_VINFO_NITERSM1 (loop_vinfo) >> + : LOOP_VINFO_NITERS (loop_vinfo); >> + >> + /* Here we build a vector factorization mask: >> + vf_mask = ~(VF - 1), where VF is the Vectorization Factor. */ >> + tree vf_mask = build_int_cst (TREE_TYPE (niters), >> + LOOP_VINFO_VECT_FACTOR (loop_vinfo)); >> + vf_mask = fold_build2 (MINUS_EXPR, TREE_TYPE (vf_mask), >> + vf_mask, >> + build_one_cst (TREE_TYPE (vf_mask))); >> + vf_mask = fold_build1 (BIT_NOT_EXPR, TREE_TYPE (niters), vf_mask); >> + >> + /* Here we calculate: >> + niters = N - ((N-G) & ~(VF -1)) */ >> + niters = fold_build2 (MINUS_EXPR, TREE_TYPE (niters), >> + LOOP_VINFO_NITERS (loop_vinfo), >> + fold_build2 (BIT_AND_EXPR, TREE_TYPE (niters), >> + niters, >> + vf_mask)); > > Might be a daft question, sorry, but why does this need to be so > complicated? Couldn't we just use the final value of the main loop's > IV to calculate how many iterations are left? > > The current code wouldn't for example work for non-power-of-2 SVE vectors. > vect_set_loop_condition_unmasked is structured to cope with that case > (in length-agnostic mode only), even when an epilogue is needed. Good call, as we discussed I changed my approach here. Rather than using a conditional expression to guard against skipping the main loop, I now use a phi-node to carry the IV. This actually already exists, so I am duplicating here, but I didn't know what the best way was to "grab" this existing IV. >> + skipped when dealing with epilogues as we assume we already checked them >> + for the main loop. So instead we always check the 'orig_loop_vinfo'. */ >> + if (LOOP_REQUIRES_VERSIONING (orig_loop_vinfo)) >> { >> poly_uint64 niters_th = 0; >> unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo); >> @@ -2307,14 +2342,8 @@ again: >> be vectorized. */ >> opt_loop_vec_info >> vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo, >> - vec_info_shared *shared) >> + vec_info_shared *shared, vector_sizes vector_sizes) >> { >> - auto_vector_sizes vector_sizes; >> - >> - /* Autodetect first vector size we try. */ >> - current_vector_size = 0; >> - targetm.vectorize.autovectorize_vector_sizes (&vector_sizes, >> - loop->simdlen != 0); >> unsigned int next_size = 0; >> >> DUMP_VECT_SCOPE ("analyze_loop_nest"); >> @@ -2335,6 +2364,9 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo, >> poly_uint64 autodetected_vector_size = 0; >> opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL); >> poly_uint64 first_vector_size = 0; >> + poly_uint64 lowest_th = 0; >> + unsigned vectorized_loops = 0; >> + bool vect_epilogues = !loop->simdlen && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK); >> while (1) >> { >> /* Check the CFG characteristics of the loop (nesting, entry/exit). */ >> @@ -2353,24 +2385,52 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo, >> >> if (orig_loop_vinfo) >> LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = orig_loop_vinfo; >> + else if (vect_epilogues && first_loop_vinfo) >> + LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo; >> >> opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts); >> if (res) >> { >> LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1; >> + vectorized_loops++; >> >> - if (loop->simdlen >> - && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo), >> - (unsigned HOST_WIDE_INT) loop->simdlen)) >> + if ((loop->simdlen >> + && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo), >> + (unsigned HOST_WIDE_INT) loop->simdlen)) >> + || vect_epilogues) >> { >> if (first_loop_vinfo == NULL) >> { >> first_loop_vinfo = loop_vinfo; >> + lowest_th >> + = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo); >> first_vector_size = current_vector_size; >> loop->aux = NULL; >> } >> else >> - delete loop_vinfo; >> + { >> + /* Keep track of vector sizes that we know we can vectorize >> + the epilogue with. */ >> + if (vect_epilogues) >> + { >> + loop->aux = NULL; >> + first_loop_vinfo->epilogue_vsizes.reserve (1); >> + first_loop_vinfo->epilogue_vsizes.quick_push (current_vector_size); >> + first_loop_vinfo->epilogue_vinfos.reserve (1); >> + first_loop_vinfo->epilogue_vinfos.quick_push (loop_vinfo); > > I've messed you around, sorry, but the patches I committed this weekend > mean we now store the vector size in the loop_vinfo. It'd be good to > avoid a separate epilogue_vsizes array if possible. Rebased. Actually quite happy with that, makes for a cleaner patch on my end :) > >> + >> + stmt_vinfo >> + = epilogue_vinfo->lookup_stmt (orig_stmt); > > Nit: fits one line. > >> + >> + STMT_VINFO_STMT (stmt_vinfo) = new_stmt; >> + gimple_set_uid (new_stmt, gimple_uid (orig_stmt)); >> + >> + mapping.put (gimple_phi_result (orig_stmt), >> + gimple_phi_result (new_stmt)); > > Nit: indented too far. > >> + >> + if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo)) >> + pattern_worklist.safe_push (stmt_vinfo); >> + >> + related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo); >> + while (related_vinfo && related_vinfo != stmt_vinfo) >> + { >> + related_worklist.safe_push (related_vinfo); >> + /* Set BB such that the assert in >> + 'get_initial_def_for_reduction' is able to determine that >> + the BB of the related stmt is inside this loop. */ >> + gimple_set_bb (STMT_VINFO_STMT (related_vinfo), >> + gimple_bb (new_stmt)); >> + related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo); >> + } >> + } >> + >> + for (epilogue_gsi = gsi_start_bb (epilogue_bbs[i]); >> + !gsi_end_p (epilogue_gsi); gsi_next (&epilogue_gsi)) >> + { >> + orig_stmt = LOOP_VINFO_UP_STMTS (epilogue_vinfo)[0]; >> + LOOP_VINFO_UP_STMTS (epilogue_vinfo).ordered_remove (0); >> + new_stmt = gsi_stmt (epilogue_gsi); >> + >> + stmt_vinfo >> + = epilogue_vinfo->lookup_stmt (orig_stmt); > > Fits on one line. > >> + >> + STMT_VINFO_STMT (stmt_vinfo) = new_stmt; >> + gimple_set_uid (new_stmt, gimple_uid (orig_stmt)); >> + >> + if (is_gimple_assign (orig_stmt)) >> + { >> + gcc_assert (is_gimple_assign (new_stmt)); >> + mapping.put (gimple_assign_lhs (orig_stmt), >> + gimple_assign_lhs (new_stmt)); >> + } > > Why just assigns? Don't we need to handle calls too? > > Maybe just use gimple_get_lhs here. Changed. >> @@ -882,10 +886,35 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab, >> LOCATION_FILE (vect_location.get_location_t ()), >> LOCATION_LINE (vect_location.get_location_t ())); >> >> + /* If this is an epilogue, we already know what vector sizes we will use for >> + vectorization as the analyzis was part of the main vectorized loop. Use >> + these instead of going through all vector sizes again. */ >> + if (orig_loop_vinfo >> + && !LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo).is_empty ()) >> + { >> + vector_sizes = LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo); >> + assert_versioning = LOOP_REQUIRES_VERSIONING (orig_loop_vinfo); >> + current_vector_size = vector_sizes[0]; >> + } >> + else >> + { >> + /* Autodetect first vector size we try. */ >> + current_vector_size = 0; >> + >> + targetm.vectorize.autovectorize_vector_sizes (&auto_vector_sizes, >> + loop->simdlen != 0); >> + vector_sizes = auto_vector_sizes; >> + } >> + >> /* Try to analyze the loop, retaining an opt_problem if dump_enabled_p. */ >> - opt_loop_vec_info loop_vinfo >> - = vect_analyze_loop (loop, orig_loop_vinfo, &shared); >> - loop->aux = loop_vinfo; >> + opt_loop_vec_info loop_vinfo = opt_loop_vec_info::success (NULL); >> + if (loop_vec_info_for_loop (loop)) >> + loop_vinfo = opt_loop_vec_info::success (loop_vec_info_for_loop (loop)); >> + else >> + { >> + loop_vinfo = vect_analyze_loop (loop, orig_loop_vinfo, &shared, vector_sizes); >> + loop->aux = loop_vinfo; >> + } > > I don't really understand what this is doing for the epilogue case. > Do we call vect_analyze_loop again? Are vector_sizes[1:] significant > for epilogues? The vector sizes code here is no longer needed after your patch. The loop_vec_info is just checking whether loop already has one set (which is the case for epilogues) and use that, or if not then analyse it (which is the case for the first vectorization). I'll add some comments. > > Thanks, > Richard >
Hi, This is the reworked patch after your comments. I have moved the epilogue check into the analysis form disguised under '!epilogue_vinfos.is_empty ()'. This because I realized that I am doing the "lowest threshold" check there. The only place where we may reject an epilogue_vinfo is when we know the number of scalar iterations and we realize the number of iterations left after the main loop are not enough to enter the vectorized epilogue so we optimize away that code-gen. The only way we know this to be true is if the number of scalar iterations are known and the peeling for alignment is known. So we know we will enter the main loop regardless, so whether the threshold we use is for a lower VF or not it shouldn't matter as much, I would even like to think that check isn't done, but I am not sure... Might be worth checking as an optimization. Is this OK for trunk? gcc/ChangeLog: 2019-10-25 Andre Vieira <andre.simoesdiasvieira@arm.com> PR 88915 * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration. * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter and make the valueize function pointer also take a void pointer. * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap around vn_valueize, to call it without a context. (process_bb): Use vn_valueize_wrapper instead of vn_valueize. * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos. (~_loop_vec_info): Release epilogue_vinfos. (vect_analyze_loop_costing): Use knowledge of main VF to estimate number of iterations of epilogue. (vect_analyze_loop_2): Adapt to analyse main loop for all supported vector sizes when vect-epilogues-nomask=1. Also keep track of lowest versioning threshold needed for main loop. (vect_analyze_loop): Likewise. (find_in_mapping): New helper function. (update_epilogue_loop_vinfo): New function. (vect_transform_loop): When vectorizing epilogues re-use analysis done on main loop and call update_epilogue_loop_vinfo to update it. * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert stmts on loop preheader edge. (vect_do_peeling): Enable skip-vectors when doing loop versioning if we decided to vectorize epilogues. Update epilogues NITERS and construct ADVANCE to update epilogues data references where needed. * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos. (vect_do_peeling, vect_update_inits_of_drs, determine_peel_for_niter, vect_analyze_loop): Add or update declarations. * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already created loop_vec_info's for epilogues when available. Otherwise analyse epilogue separately. Cheers, Andre
On Fri, 25 Oct 2019, Andre Vieira (lists) wrote: > > > On 22/10/2019 14:56, Richard Biener wrote: > > On Tue, 22 Oct 2019, Andre Vieira (lists) wrote: > > > >> Hi Richi, > >> > >> See inline responses to your comments. > >> > >> On 11/10/2019 13:57, Richard Biener wrote: > >>> On Thu, 10 Oct 2019, Andre Vieira (lists) wrote: > >>> > >>>> Hi, > >>>> > >> > >>> > >>> + > >>> + /* Keep track of vector sizes we know we can vectorize the epilogue > >>> with. */ > >>> + vector_sizes epilogue_vsizes; > >>> }; > >>> > >>> please don't enlarge struct loop, instead track this somewhere > >>> in the vectorizer (in loop_vinfo? I see you already have > >>> epilogue_vinfos there - so the loop_vinfo simply lacks > >>> convenient access to the vector_size?) I don't see any > >>> use that could be trivially adjusted to look at a loop_vinfo > >>> member instead. > >> > >> Done. > >>> > >>> For the vect_update_inits_of_drs this means that we'd possibly > >>> do less CSE. Not sure if really an issue. > >> > >> CSE of what exactly? You are afraid we are repeating a calculation here we > >> have done elsewhere before? > > > > All uses of those inits now possibly get the expression instead of > > just the SSA name we inserted code for once. But as said, we'll see. > > > > This code changed after some comments from Richard Sandiford. > > > + /* We are done vectorizing the main loop, so now we update the > > epilogues > > + stmt_vec_info's. At the same time we set the gimple UID of each > > + statement in the epilogue, as these are used to look them up in the > > + epilogues loop_vec_info later. We also keep track of what > > + stmt_vec_info's have PATTERN_DEF_SEQ's and RELATED_STMT's that might > > + need updating and we construct a mapping between variables defined > > in > > + the main loop and their corresponding names in epilogue. */ > > + for (unsigned i = 0; i < epilogue->num_nodes; ++i) > > > > so for the following code I wonder if you can make use of the > > fact that loop copying also copies UIDs, so you should be able > > to match stmts via their UIDs and get at the other loop infos > > stmt_info by the copy loop stmt UID. > > > > I wonder why you need no modification for the SLP tree? > > > I checked with Tamar and the SLP tree works with the position of operands and > not SSA_NAMES. So we should be fine. There's now SLP_TREE_SCALAR_OPS but only for invariants so I guess we should indeed be fine here. Everything else is already stmt_infos which you patch with the new underlying stmts. Richard.
On Fri, 25 Oct 2019, Andre Vieira (lists) wrote: > Hi, > > This is the reworked patch after your comments. > > I have moved the epilogue check into the analysis form disguised under > '!epilogue_vinfos.is_empty ()'. This because I realized that I am doing the > "lowest threshold" check there. > > The only place where we may reject an epilogue_vinfo is when we know the > number of scalar iterations and we realize the number of iterations left after > the main loop are not enough to enter the vectorized epilogue so we optimize > away that code-gen. The only way we know this to be true is if the number of > scalar iterations are known and the peeling for alignment is known. So we know > we will enter the main loop regardless, so whether the threshold we use is for > a lower VF or not it shouldn't matter as much, I would even like to think that > check isn't done, but I am not sure... Might be worth checking as an > optimization. > > > Is this OK for trunk? + for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]); + !gsi_end_p (epilogue_phi_gsi); gsi_next (&epilogue_phi_gsi)) + { .. + if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo)) + pattern_worklist.safe_push (stmt_vinfo); + + related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo); + while (related_vinfo && related_vinfo != stmt_vinfo) + { I think PHIs cannot have patterns. You can assert that STMT_VINFO_RELATED_STMT is NULL I think. + related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo); + while (related_vinfo && related_vinfo != stmt_vinfo) + { + related_worklist.safe_push (related_vinfo); + /* Set BB such that the assert in + 'get_initial_def_for_reduction' is able to determine that + the BB of the related stmt is inside this loop. */ + gimple_set_bb (STMT_VINFO_STMT (related_vinfo), + gimple_bb (new_stmt)); + related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo); + } do we really keep references to "nested" patterns? Thus, do you need this loop? + /* The PATTERN_DEF_SEQs in the epilogue were constructed using the + original main loop and thus need to be updated to refer to the cloned + variables used in the epilogue. */ + for (unsigned i = 0; i < pattern_worklist.length (); ++i) + { ... + op = simplify_replace_tree (op, NULL_TREE, NULL_TREE, + &find_in_mapping, &mapping); + gimple_set_op (seq, j, op); you do this for the pattern-def seq but not for the related one. I guess you ran into this for COND_EXPR conditions. I wondered to use a shared worklist for both the def-seq and the main pattern stmt or at least to split out the replacement so you can share it. + /* Data references for gather loads and scatter stores do not use the + updated offset we set using ADVANCE. Instead we have to make sure the + reference in the data references point to the corresponding copy of + the original in the epilogue. */ + if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo)) + { + int j; + if (TREE_CODE (DR_REF (dr)) == MEM_REF) + j = 0; + else if (TREE_CODE (DR_REF (dr)) == ARRAY_REF) + j = 1; + else + gcc_unreachable (); + + if (tree *new_op = mapping.get (TREE_OPERAND (DR_REF (dr), j))) + { + DR_REF (dr) = unshare_expr (DR_REF (dr)); + TREE_OPERAND (DR_REF (dr), j) = *new_op; + } huh, do you really only ever see MEM_REF or ARRAY_REF here? I would guess using simplify_replace_tree is safer. There's also DR_BASE_ADDRESS - we seem to leave the DRs partially updated, is that correct? Otherwise looks OK to me. Thanks, Richard. > gcc/ChangeLog: > 2019-10-25 Andre Vieira <andre.simoesdiasvieira@arm.com> > > PR 88915 > * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration. > * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter > and make the valueize function pointer also take a void pointer. > * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap > around vn_valueize, to call it without a context. > (process_bb): Use vn_valueize_wrapper instead of vn_valueize. > * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos. > (~_loop_vec_info): Release epilogue_vinfos. > (vect_analyze_loop_costing): Use knowledge of main VF to estimate > number of iterations of epilogue. > (vect_analyze_loop_2): Adapt to analyse main loop for all supported > vector sizes when vect-epilogues-nomask=1. Also keep track of lowest > versioning threshold needed for main loop. > (vect_analyze_loop): Likewise. > (find_in_mapping): New helper function. > (update_epilogue_loop_vinfo): New function. > (vect_transform_loop): When vectorizing epilogues re-use analysis done > on main loop and call update_epilogue_loop_vinfo to update it. > * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert > stmts on loop preheader edge. > (vect_do_peeling): Enable skip-vectors when doing loop versioning if > we decided to vectorize epilogues. Update epilogues NITERS and > construct ADVANCE to update epilogues data references where needed. > * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos. > (vect_do_peeling, vect_update_inits_of_drs, > determine_peel_for_niter, vect_analyze_loop): Add or update declarations. > * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already > created loop_vec_info's for epilogues when available. Otherwise > analyse > epilogue separately. > > > > Cheers, > Andre >
Hi, Reworked according to your comments, see inline for clarification. Is this OK for trunk? gcc/ChangeLog: 2019-10-28 Andre Vieira <andre.simoesdiasvieira@arm.com> PR 88915 * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration. * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter and make the valueize function pointer also take a void pointer. * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap around vn_valueize, to call it without a context. (process_bb): Use vn_valueize_wrapper instead of vn_valueize. * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos. (~_loop_vec_info): Release epilogue_vinfos. (vect_analyze_loop_costing): Use knowledge of main VF to estimate number of iterations of epilogue. (vect_analyze_loop_2): Adapt to analyse main loop for all supported vector sizes when vect-epilogues-nomask=1. Also keep track of lowest versioning threshold needed for main loop. (vect_analyze_loop): Likewise. (find_in_mapping): New helper function. (update_epilogue_loop_vinfo): New function. (vect_transform_loop): When vectorizing epilogues re-use analysis done on main loop and call update_epilogue_loop_vinfo to update it. * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert stmts on loop preheader edge. (vect_do_peeling): Enable skip-vectors when doing loop versioning if we decided to vectorize epilogues. Update epilogues NITERS and construct ADVANCE to update epilogues data references where needed. * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos. (vect_do_peeling, vect_update_inits_of_drs, determine_peel_for_niter, vect_analyze_loop): Add or update declarations. * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already created loop_vec_info's for epilogues when available. Otherwise analyse epilogue separately. Cheers, Andre On 28/10/2019 14:16, Richard Biener wrote: > On Fri, 25 Oct 2019, Andre Vieira (lists) wrote: > >> Hi, >> >> This is the reworked patch after your comments. >> >> I have moved the epilogue check into the analysis form disguised under >> '!epilogue_vinfos.is_empty ()'. This because I realized that I am doing the >> "lowest threshold" check there. >> >> The only place where we may reject an epilogue_vinfo is when we know the >> number of scalar iterations and we realize the number of iterations left after >> the main loop are not enough to enter the vectorized epilogue so we optimize >> away that code-gen. The only way we know this to be true is if the number of >> scalar iterations are known and the peeling for alignment is known. So we know >> we will enter the main loop regardless, so whether the threshold we use is for >> a lower VF or not it shouldn't matter as much, I would even like to think that >> check isn't done, but I am not sure... Might be worth checking as an >> optimization. >> >> >> Is this OK for trunk? > > + for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]); > + !gsi_end_p (epilogue_phi_gsi); gsi_next (&epilogue_phi_gsi)) > + { > .. > + if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo)) > + pattern_worklist.safe_push (stmt_vinfo); > + > + related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo); > + while (related_vinfo && related_vinfo != stmt_vinfo) > + { > > I think PHIs cannot have patterns. You can assert > that STMT_VINFO_RELATED_STMT is NULL I think. Done. > > + related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo); > + while (related_vinfo && related_vinfo != stmt_vinfo) > + { > + related_worklist.safe_push (related_vinfo); > + /* Set BB such that the assert in > + 'get_initial_def_for_reduction' is able to determine that > + the BB of the related stmt is inside this loop. */ > + gimple_set_bb (STMT_VINFO_STMT (related_vinfo), > + gimple_bb (new_stmt)); > + related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo); > + } > > do we really keep references to "nested" patterns? Thus, do you > need this loop? Changed and added asserts. They didn't trigger so I suppose you are right, I didn't know at the time whether it was possible, so I just operated on the side of caution. Can remove the asserts and so on if you want. > > + /* The PATTERN_DEF_SEQs in the epilogue were constructed using the > + original main loop and thus need to be updated to refer to the > cloned > + variables used in the epilogue. */ > + for (unsigned i = 0; i < pattern_worklist.length (); ++i) > + { > ... > + op = simplify_replace_tree (op, NULL_TREE, NULL_TREE, > + &find_in_mapping, &mapping); > + gimple_set_op (seq, j, op); > > you do this for the pattern-def seq but not for the related one. > I guess you ran into this for COND_EXPR conditions. I wondered > to use a shared worklist for both the def-seq and the main pattern > stmt or at least to split out the replacement so you can share it. I think that was it yeah, reworked it now to use the same list. Less code, thanks! > > + /* Data references for gather loads and scatter stores do not use > the > + updated offset we set using ADVANCE. Instead we have to make > sure the > + reference in the data references point to the corresponding copy > of > + the original in the epilogue. */ > + if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo)) > + { > + int j; > + if (TREE_CODE (DR_REF (dr)) == MEM_REF) > + j = 0; > + else if (TREE_CODE (DR_REF (dr)) == ARRAY_REF) > + j = 1; > + else > + gcc_unreachable (); > + > + if (tree *new_op = mapping.get (TREE_OPERAND (DR_REF (dr), j))) > + { > + DR_REF (dr) = unshare_expr (DR_REF (dr)); > + TREE_OPERAND (DR_REF (dr), j) = *new_op; > + } > > huh, do you really only ever see MEM_REF or ARRAY_REF here? > I would guess using simplify_replace_tree is safer. > There's also DR_BASE_ADDRESS - we seem to leave the DRs partially > updated, is that correct? Yeah can use simplify_replace_tree indeed. And I have changed it so it updates DR_BASE_ADDRESS. I think DR_BASE_ADDRESS never actually changed in the way we use data_references... Either way, replacing them if they do change is cleaner and more future proof. > > Otherwise looks OK to me. > > Thanks, > Richard. > > >> gcc/ChangeLog: >> 2019-10-25 Andre Vieira <andre.simoesdiasvieira@arm.com> >> >> PR 88915 >> * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration. >> * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter >> and make the valueize function pointer also take a void pointer. >> * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap >> around vn_valueize, to call it without a context. >> (process_bb): Use vn_valueize_wrapper instead of vn_valueize. >> * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos. >> (~_loop_vec_info): Release epilogue_vinfos. >> (vect_analyze_loop_costing): Use knowledge of main VF to estimate >> number of iterations of epilogue. >> (vect_analyze_loop_2): Adapt to analyse main loop for all supported >> vector sizes when vect-epilogues-nomask=1. Also keep track of lowest >> versioning threshold needed for main loop. >> (vect_analyze_loop): Likewise. >> (find_in_mapping): New helper function. >> (update_epilogue_loop_vinfo): New function. >> (vect_transform_loop): When vectorizing epilogues re-use analysis done >> on main loop and call update_epilogue_loop_vinfo to update it. >> * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert >> stmts on loop preheader edge. >> (vect_do_peeling): Enable skip-vectors when doing loop versioning if >> we decided to vectorize epilogues. Update epilogues NITERS and >> construct ADVANCE to update epilogues data references where needed. >> * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos. >> (vect_do_peeling, vect_update_inits_of_drs, >> determine_peel_for_niter, vect_analyze_loop): Add or update declarations. >> * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already >> created loop_vec_info's for epilogues when available. Otherwise >> analyse >> epilogue separately. >> >> >> >> Cheers, >> Andre >> >
On Mon, 28 Oct 2019, Andre Vieira (lists) wrote: > Hi, > > Reworked according to your comments, see inline for clarification. > > Is this OK for trunk? + gimple_seq seq = STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo); + while (seq) + { + stmt_worklist.safe_push (seq); + seq = seq->next; + } you're supposed to do to the following, not access the ->next implementation detail: for (gimple_stmt_iterator gsi = gsi_start (seq); !gsi_end_p (gsi); gsi_next (&gsi)) stmt_worklist.safe_push (gsi_stmt (gsi)); + /* Data references for gather loads and scatter stores do not use the + updated offset we set using ADVANCE. Instead we have to make sure the + reference in the data references point to the corresponding copy of + the original in the epilogue. */ + if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo)) + { + DR_REF (dr) + = simplify_replace_tree (DR_REF (dr), NULL_TREE, NULL_TREE, + &find_in_mapping, &mapping); + DR_BASE_ADDRESS (dr) + = simplify_replace_tree (DR_BASE_ADDRESS (dr), NULL_TREE, NULL_TREE, + &find_in_mapping, &mapping); + } Hmm. So for other DRs we account for the previous vector loop by adjusting DR_OFFSET? But STMT_VINFO_GATHER_SCATTER_P ends up using (unconditionally) DR_REF here? In that case it seems best to adjust DR_REF only but NULL out DR_BASE_ADDRESS and DR_OFFSET? I wonder how prologue peeling deals with STMT_VINFO_GATHER_SCATTER_P ... I see the caller of vect_update_init_of_dr there does nothing for STMT_VINFO_GATHER_SCATTER_P. I wonder if (as followup to not delay this further) we can "offload" all the DR adjustment by storing ADVANCE in dr_vec_info and accounting for that when we create the dataref pointers in vectorizable_load/store? That way we could avoid saving/restoring DR_OFFSET as well. So, the patch is OK with the sequence iteration fixed. I think sorting out the above can be done as followup. Thanks, Richard. > gcc/ChangeLog: > 2019-10-28 Andre Vieira <andre.simoesdiasvieira@arm.com> > > PR 88915 > * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration. > * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter > and make the valueize function pointer also take a void pointer. > * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap > around vn_valueize, to call it without a context. > (process_bb): Use vn_valueize_wrapper instead of vn_valueize. > * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos. > (~_loop_vec_info): Release epilogue_vinfos. > (vect_analyze_loop_costing): Use knowledge of main VF to estimate > number of iterations of epilogue. > (vect_analyze_loop_2): Adapt to analyse main loop for all supported > vector sizes when vect-epilogues-nomask=1. Also keep track of lowest > versioning threshold needed for main loop. > (vect_analyze_loop): Likewise. > (find_in_mapping): New helper function. > (update_epilogue_loop_vinfo): New function. > (vect_transform_loop): When vectorizing epilogues re-use analysis done > on main loop and call update_epilogue_loop_vinfo to update it. > * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert > stmts on loop preheader edge. > (vect_do_peeling): Enable skip-vectors when doing loop versioning if > we decided to vectorize epilogues. Update epilogues NITERS and > construct ADVANCE to update epilogues data references where needed. > * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos. > (vect_do_peeling, vect_update_inits_of_drs, > determine_peel_for_niter, vect_analyze_loop): Add or update declarations. > * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already > created loop_vec_info's for epilogues when available. Otherwise > analyse > epilogue separately. > > > > Cheers, > Andre > > On 28/10/2019 14:16, Richard Biener wrote: > > On Fri, 25 Oct 2019, Andre Vieira (lists) wrote: > > > >> Hi, > >> > >> This is the reworked patch after your comments. > >> > >> I have moved the epilogue check into the analysis form disguised under > >> '!epilogue_vinfos.is_empty ()'. This because I realized that I am doing > >> the > >> "lowest threshold" check there. > >> > >> The only place where we may reject an epilogue_vinfo is when we know the > >> number of scalar iterations and we realize the number of iterations left > >> after > >> the main loop are not enough to enter the vectorized epilogue so we > >> optimize > >> away that code-gen. The only way we know this to be true is if the number > >> of > >> scalar iterations are known and the peeling for alignment is known. So we > >> know > >> we will enter the main loop regardless, so whether the threshold we use is > >> for > >> a lower VF or not it shouldn't matter as much, I would even like to think > >> that > >> check isn't done, but I am not sure... Might be worth checking as an > >> optimization. > >> > >> > >> Is this OK for trunk? > > > > + for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]); > > + !gsi_end_p (epilogue_phi_gsi); gsi_next (&epilogue_phi_gsi)) > > + { > > .. > > + if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo)) > > + pattern_worklist.safe_push (stmt_vinfo); > > + > > + related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo); > > + while (related_vinfo && related_vinfo != stmt_vinfo) > > + { > > > > I think PHIs cannot have patterns. You can assert > > that STMT_VINFO_RELATED_STMT is NULL I think. > > Done. > > > > + related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo); > > + while (related_vinfo && related_vinfo != stmt_vinfo) > > + { > > + related_worklist.safe_push (related_vinfo); > > + /* Set BB such that the assert in > > + 'get_initial_def_for_reduction' is able to determine that > > + the BB of the related stmt is inside this loop. */ > > + gimple_set_bb (STMT_VINFO_STMT (related_vinfo), > > + gimple_bb (new_stmt)); > > + related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo); > > + } > > > > do we really keep references to "nested" patterns? Thus, do you > > need this loop? > > Changed and added asserts. They didn't trigger so I suppose you are right, I > didn't know at the time whether it was possible, so I just operated on the > side of caution. Can remove the asserts and so on if you want. > > > > + /* The PATTERN_DEF_SEQs in the epilogue were constructed using the > > + original main loop and thus need to be updated to refer to the > > cloned > > + variables used in the epilogue. */ > > + for (unsigned i = 0; i < pattern_worklist.length (); ++i) > > + { > > ... > > + op = simplify_replace_tree (op, NULL_TREE, NULL_TREE, > > + &find_in_mapping, &mapping); > > + gimple_set_op (seq, j, op); > > > > you do this for the pattern-def seq but not for the related one. > > I guess you ran into this for COND_EXPR conditions. I wondered > > to use a shared worklist for both the def-seq and the main pattern > > stmt or at least to split out the replacement so you can share it. > > I think that was it yeah, reworked it now to use the same list. Less code, > thanks! > > > > + /* Data references for gather loads and scatter stores do not use > > the > > + updated offset we set using ADVANCE. Instead we have to make > > sure the > > + reference in the data references point to the corresponding copy > > of > > + the original in the epilogue. */ > > + if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo)) > > + { > > + int j; > > + if (TREE_CODE (DR_REF (dr)) == MEM_REF) > > + j = 0; > > + else if (TREE_CODE (DR_REF (dr)) == ARRAY_REF) > > + j = 1; > > + else > > + gcc_unreachable (); > > + > > + if (tree *new_op = mapping.get (TREE_OPERAND (DR_REF (dr), j))) > > + { > > + DR_REF (dr) = unshare_expr (DR_REF (dr)); > > + TREE_OPERAND (DR_REF (dr), j) = *new_op; > > + } > > > > huh, do you really only ever see MEM_REF or ARRAY_REF here? > > I would guess using simplify_replace_tree is safer. > > There's also DR_BASE_ADDRESS - we seem to leave the DRs partially > > updated, is that correct? > > Yeah can use simplify_replace_tree indeed. And I have changed it so it > updates DR_BASE_ADDRESS. I think DR_BASE_ADDRESS never actually changed in > the way we use data_references... Either way, replacing them if they do change > is cleaner and more future proof. > > > > Otherwise looks OK to me. > > > > Thanks, > > Richard. > > > > > >> gcc/ChangeLog: > >> 2019-10-25 Andre Vieira <andre.simoesdiasvieira@arm.com> > >> > >> PR 88915 > >> * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration. > >> * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter > >> and make the valueize function pointer also take a void pointer. > >> * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap > >> around vn_valueize, to call it without a context. > >> (process_bb): Use vn_valueize_wrapper instead of vn_valueize. > >> * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos. > >> (~_loop_vec_info): Release epilogue_vinfos. > >> (vect_analyze_loop_costing): Use knowledge of main VF to estimate > >> number of iterations of epilogue. > >> (vect_analyze_loop_2): Adapt to analyse main loop for all supported > >> vector sizes when vect-epilogues-nomask=1. Also keep track of lowest > >> versioning threshold needed for main loop. > >> (vect_analyze_loop): Likewise. > >> (find_in_mapping): New helper function. > >> (update_epilogue_loop_vinfo): New function. > >> (vect_transform_loop): When vectorizing epilogues re-use analysis done > >> on main loop and call update_epilogue_loop_vinfo to update it. > >> * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert > >> stmts on loop preheader edge. > >> (vect_do_peeling): Enable skip-vectors when doing loop versioning if > >> we decided to vectorize epilogues. Update epilogues NITERS and > >> construct ADVANCE to update epilogues data references where needed. > >> * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos. > >> (vect_do_peeling, vect_update_inits_of_drs, > >> determine_peel_for_niter, vect_analyze_loop): Add or update > >> declarations. > >> * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already > >> created loop_vec_info's for epilogues when available. Otherwise > >> analyse > >> epilogue separately. > >> > >> > >> > >> Cheers, > >> Andre > >> > > > >
diff --git a/gcc/gengtype.c b/gcc/gengtype.c index 53317337cf8c8e8caefd6b819d28b3bba301e755..56ffa08a7dee54837441f0c743f8c0faa285c74b 100644 --- a/gcc/gengtype.c +++ b/gcc/gengtype.c @@ -5197,6 +5197,7 @@ main (int argc, char **argv) POS_HERE (do_scalar_typedef ("widest_int", &pos)); POS_HERE (do_scalar_typedef ("int64_t", &pos)); POS_HERE (do_scalar_typedef ("poly_int64", &pos)); + POS_HERE (do_scalar_typedef ("poly_uint64", &pos)); POS_HERE (do_scalar_typedef ("uint64_t", &pos)); POS_HERE (do_scalar_typedef ("uint8", &pos)); POS_HERE (do_scalar_typedef ("uintptr_t", &pos)); diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index 5c25441c70a271f04730486e513437fffa75b7e3..3b5f14c45b5b9b601120c6776734bbafefe1e178 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -2401,7 +2401,8 @@ class loop * vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, tree *niters_vector, tree *step_vector, tree *niters_vector_mult_vf_var, int th, - bool check_profitability, bool niters_no_overflow) + bool check_profitability, bool niters_no_overflow, + bool vect_epilogues_nomask) { edge e, guard_e; tree type = TREE_TYPE (niters), guard_cond; @@ -2474,7 +2475,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo), bound_prolog + bound_epilog) - : !LOOP_REQUIRES_VERSIONING (loop_vinfo)); + : (!LOOP_REQUIRES_VERSIONING (loop_vinfo) + || vect_epilogues_nomask)); /* Epilog loop must be executed if the number of iterations for epilog loop is known at compile time, otherwise we need to add a check at the end of vector loop and skip to the end of epilog loop. */ @@ -2966,9 +2968,7 @@ vect_create_cond_for_alias_checks (loop_vec_info loop_vinfo, tree * cond_expr) *COND_EXPR_STMT_LIST. */ class loop * -vect_loop_versioning (loop_vec_info loop_vinfo, - unsigned int th, bool check_profitability, - poly_uint64 versioning_threshold) +vect_loop_versioning (loop_vec_info loop_vinfo) { class loop *loop = LOOP_VINFO_LOOP (loop_vinfo), *nloop; class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo); @@ -2988,10 +2988,15 @@ vect_loop_versioning (loop_vec_info loop_vinfo, bool version_align = LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT (loop_vinfo); bool version_alias = LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo); bool version_niter = LOOP_REQUIRES_VERSIONING_FOR_NITERS (loop_vinfo); + poly_uint64 versioning_threshold + = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo); tree version_simd_if_cond = LOOP_REQUIRES_VERSIONING_FOR_SIMD_IF_COND (loop_vinfo); + unsigned th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo); - if (check_profitability) + if (th >= vect_vf_for_cost (loop_vinfo) + && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) + && !ordered_p (th, versioning_threshold)) cond_expr = fold_build2 (GE_EXPR, boolean_type_node, scalar_loop_iters, build_int_cst (TREE_TYPE (scalar_loop_iters), th - 1)); diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index b0cbbac0cb5ba1ffce706715d3dbb9139063803d..305ee2b06eabde9091049da829e6fc93161aa13f 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -1858,7 +1858,8 @@ vect_dissolve_slp_only_groups (loop_vec_info loop_vinfo) for it. The different analyses will record information in the loop_vec_info struct. */ static opt_result -vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts) +vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts, + bool *vect_epilogues_nomask) { opt_result ok = opt_result::success (); int res; @@ -2179,6 +2180,11 @@ start_over: } } + /* Disable epilogue vectorization if versioning is required because of the + iteration count. TODO: Needs investigation as to whether it is possible + to vectorize epilogues in this case. */ + *vect_epilogues_nomask &= !LOOP_REQUIRES_VERSIONING_FOR_NITERS (loop_vinfo); + /* During peeling, we need to check if number of loop iterations is enough for both peeled prolog loop and vector loop. This check can be merged along with threshold check of loop versioning, so @@ -2186,6 +2192,7 @@ start_over: if (LOOP_REQUIRES_VERSIONING (loop_vinfo)) { poly_uint64 niters_th = 0; + unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo); if (!vect_use_loop_mask_for_alignment_p (loop_vinfo)) { @@ -2206,6 +2213,14 @@ start_over: /* One additional iteration because of peeling for gap. */ if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)) niters_th += 1; + + /* Use the same condition as vect_transform_loop to decide when to use + the cost to determine a versioning threshold. */ + if (th >= vect_vf_for_cost (loop_vinfo) + && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) + && ordered_p (th, niters_th)) + niters_th = ordered_max (poly_uint64 (th), niters_th); + LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo) = niters_th; } @@ -2329,7 +2344,7 @@ again: be vectorized. */ opt_loop_vec_info vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo, - vec_info_shared *shared) + vec_info_shared *shared, bool *vect_epilogues_nomask) { auto_vector_sizes vector_sizes; @@ -2357,6 +2372,7 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo, poly_uint64 autodetected_vector_size = 0; opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL); poly_uint64 first_vector_size = 0; + unsigned vectorized_loops = 0; while (1) { /* Check the CFG characteristics of the loop (nesting, entry/exit). */ @@ -2376,14 +2392,17 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo, if (orig_loop_vinfo) LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = orig_loop_vinfo; - opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts); + opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts, + vect_epilogues_nomask); if (res) { LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1; + vectorized_loops++; - if (loop->simdlen - && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo), - (unsigned HOST_WIDE_INT) loop->simdlen)) + if ((loop->simdlen + && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo), + (unsigned HOST_WIDE_INT) loop->simdlen)) + || *vect_epilogues_nomask) { if (first_loop_vinfo == NULL) { @@ -2392,7 +2411,13 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo, loop->aux = NULL; } else - delete loop_vinfo; + { + /* Set versioning threshold of the original LOOP_VINFO based + on the last vectorization of the epilog. */ + LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo) + = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo); + delete loop_vinfo; + } } else { @@ -2401,7 +2426,12 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo, } } else - delete loop_vinfo; + { + /* Disable epilog vectorization if we can't determine the epilogs can + be vectorized. */ + *vect_epilogues_nomask &= vectorized_loops > 1; + delete loop_vinfo; + } if (next_size == 0) autodetected_vector_size = current_vector_size; @@ -8468,7 +8498,7 @@ vect_transform_loop_stmt (loop_vec_info loop_vinfo, stmt_vec_info stmt_info, Returns scalar epilogue loop if any. */ class loop * -vect_transform_loop (loop_vec_info loop_vinfo) +vect_transform_loop (loop_vec_info loop_vinfo, bool vect_epilogues_nomask) { class loop *loop = LOOP_VINFO_LOOP (loop_vinfo); class loop *epilogue = NULL; @@ -8497,11 +8527,11 @@ vect_transform_loop (loop_vec_info loop_vinfo) if (th >= vect_vf_for_cost (loop_vinfo) && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)) { - if (dump_enabled_p ()) - dump_printf_loc (MSG_NOTE, vect_location, - "Profitability threshold is %d loop iterations.\n", - th); - check_profitability = true; + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "Profitability threshold is %d loop iterations.\n", + th); + check_profitability = true; } /* Make sure there exists a single-predecessor exit bb. Do this before @@ -8519,18 +8549,8 @@ vect_transform_loop (loop_vec_info loop_vinfo) if (LOOP_REQUIRES_VERSIONING (loop_vinfo)) { - poly_uint64 versioning_threshold - = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo); - if (check_profitability - && ordered_p (poly_uint64 (th), versioning_threshold)) - { - versioning_threshold = ordered_max (poly_uint64 (th), - versioning_threshold); - check_profitability = false; - } class loop *sloop - = vect_loop_versioning (loop_vinfo, th, check_profitability, - versioning_threshold); + = vect_loop_versioning (loop_vinfo); sloop->force_vectorize = false; check_profitability = false; } @@ -8557,7 +8577,8 @@ vect_transform_loop (loop_vec_info loop_vinfo) bool niters_no_overflow = loop_niters_no_overflow (loop_vinfo); epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector, &step_vector, &niters_vector_mult_vf, th, - check_profitability, niters_no_overflow); + check_profitability, niters_no_overflow, + vect_epilogues_nomask); if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo) && LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo).initialized_p ()) scale_loop_frequencies (LOOP_VINFO_SCALAR_LOOP (loop_vinfo), @@ -8818,7 +8839,7 @@ vect_transform_loop (loop_vec_info loop_vinfo) if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)) epilogue = NULL; - if (!PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK)) + if (!vect_epilogues_nomask) epilogue = NULL; if (epilogue) diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 1456cde4c2c2dec7244c504d2c496248894a4f1e..e87170c592036a6f3f5330e1ebf5d125441861a6 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -1480,10 +1480,10 @@ extern void vect_set_loop_condition (class loop *, loop_vec_info, extern bool slpeel_can_duplicate_loop_p (const class loop *, const_edge); class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *, class loop *, edge); -class loop *vect_loop_versioning (loop_vec_info, unsigned int, bool, - poly_uint64); +class loop *vect_loop_versioning (loop_vec_info); extern class loop *vect_do_peeling (loop_vec_info, tree, tree, - tree *, tree *, tree *, int, bool, bool); + tree *, tree *, tree *, int, bool, bool, + bool); extern void vect_prepare_for_masked_peels (loop_vec_info); extern dump_user_location_t find_loop_location (class loop *); extern bool vect_can_advance_ivs_p (loop_vec_info); @@ -1610,7 +1610,8 @@ extern bool check_reduction_path (dump_user_location_t, loop_p, gphi *, tree, /* Drive for loop analysis stage. */ extern opt_loop_vec_info vect_analyze_loop (class loop *, loop_vec_info, - vec_info_shared *); + vec_info_shared *, + bool *); extern tree vect_build_loop_niters (loop_vec_info, bool * = NULL); extern void vect_gen_vector_loop_niters (loop_vec_info, tree, tree *, tree *, bool); @@ -1622,7 +1623,7 @@ extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *, unsigned int, tree, unsigned int); /* Drive for loop transformation stage. */ -extern class loop *vect_transform_loop (loop_vec_info); +extern class loop *vect_transform_loop (loop_vec_info, bool); extern opt_loop_vec_info vect_analyze_loop_form (class loop *, vec_info_shared *); extern bool vectorizable_live_operation (stmt_vec_info, gimple_stmt_iterator *, diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index 173e6b51652fd023893b38da786ff28f827553b5..25c3fc8ff55e017ae0b971fa93ce8ce2a07cb94c 100644 --- a/gcc/tree-vectorizer.c +++ b/gcc/tree-vectorizer.c @@ -61,6 +61,7 @@ along with GCC; see the file COPYING3. If not see #include "tree.h" #include "gimple.h" #include "predict.h" +#include "params.h" #include "tree-pass.h" #include "ssa.h" #include "cgraph.h" @@ -875,6 +876,7 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab, vec_info_shared shared; auto_purge_vect_location sentinel; vect_location = find_loop_location (loop); + bool vect_epilogues_nomask = PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK); if (LOCATION_LOCUS (vect_location.get_location_t ()) != UNKNOWN_LOCATION && dump_enabled_p ()) dump_printf (MSG_NOTE | MSG_PRIORITY_INTERNALS, @@ -884,7 +886,7 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab, /* Try to analyze the loop, retaining an opt_problem if dump_enabled_p. */ opt_loop_vec_info loop_vinfo - = vect_analyze_loop (loop, orig_loop_vinfo, &shared); + = vect_analyze_loop (loop, orig_loop_vinfo, &shared, &vect_epilogues_nomask); loop->aux = loop_vinfo; if (!loop_vinfo) @@ -980,7 +982,7 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab, "loop vectorized using variable length vectors\n"); } - loop_p new_loop = vect_transform_loop (loop_vinfo); + loop_p new_loop = vect_transform_loop (loop_vinfo, vect_epilogues_nomask); (*num_vectorized_loops)++; /* Now that the loop has been vectorized, allow it to be unrolled etc. */ @@ -1013,8 +1015,13 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab, /* Epilogue of vectorized loop must be vectorized too. */ if (new_loop) - ret |= try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops, - new_loop, loop_vinfo, NULL, NULL); + { + /* Don't include vectorized epilogues in the "vectorized loops" count. + */ + unsigned dont_count = *num_vectorized_loops; + ret |= try_vectorize_loop_1 (simduid_to_vf_htab, &dont_count, + new_loop, loop_vinfo, NULL, NULL); + } return ret; }