Message ID | 698d5b64-5e1b-d0bd-f2ff-f5cd2763dbaa@arm.com |
---|---|
State | New |
Headers | show |
Series | [aarch64] Implement determine_suggested_unroll_factor | expand |
"Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes: > Hi, > > This patch implements the costing function > determine_suggested_unroll_factor for aarch64. > It determines the unrolling factor by dividing the number of X > operations we can do per cycle by the number of X operations in the loop > body, taking this information from the vec_ops analysis during vector > costing and the available issue_info information. > We multiply the dividend by a potential reduction_latency, to improve > our pipeline utilization if we are stalled waiting on a particular > reduction operation. > > Right now we also have a work around for vectorization choices where the > main loop uses a NEON mode and predication is available, such that if > the main loop makes use of a NEON pattern that is not directly supported > by SVE we do not unroll, as that might cause performance regressions in > cases where we would enter the original main loop's VF. As an example if > you have a loop where you could use AVG_CEIL with a V8HI mode, you would > originally get 8x NEON using AVG_CEIL followed by a 8x SVE predicated > epilogue, using other instructions. Whereas with the unrolling you would > end up with 16x AVG_CEIL NEON + 8x SVE predicated loop, thus skipping > the original 8x NEON. In the future, we could handle this differently, > by either using a different costing model for epilogues, or potentially > vectorizing more than one single epilogue. > > gcc/ChangeLog: > > * config/aarch64/aarch64.cc (aarch64_vector_costs): Define > determine_suggested_unroll_factor. > (determine_suggested_unroll_factor): New function. > (aarch64_vector_costs::finish_costs): Use > determine_suggested_unroll_factor. > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index b5687aab59f630920e51b742b80a540c3a56c6c8..9d3a607d378d6a2792efa7c6dece2a65c24e4521 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -15680,6 +15680,7 @@ private: > unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *, > unsigned int); > bool prefer_unrolled_loop () const; > + unsigned int determine_suggested_unroll_factor (); > > /* True if we have performed one-time initialization based on the > vec_info. */ > @@ -16768,6 +16769,105 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops, > return sve_cycles_per_iter; > } > > +unsigned int > +aarch64_vector_costs::determine_suggested_unroll_factor () > +{ > + auto *issue_info = aarch64_tune_params.vec_costs->issue_info; > + if (!issue_info) > + return 1; > + bool sve = false; > + if (aarch64_sve_mode_p (m_vinfo->vector_mode)) Other code uses m_vec_flags & VEC_ANY_SVE for this. > + { > + if (!issue_info->sve) > + return 1; > + sve = true; > + } > + else > + { > + if (!issue_info->advsimd) > + return 1; The issue info should instead be taken from vec_ops.simd_issue_info () in the loop below. It can vary for each entry. > + /* If we are trying to unroll a NEON main loop that contains patterns s/a NEON/an Advanced SIMD/ > + that we do not support with SVE and we might use a predicated > + epilogue, we need to be conservative and block unrolling as this might > + lead to a less optimal loop for the first and only epilogue using the > + original loop's vectorization factor. > + TODO: Remove this constraint when we add support for multiple epilogue > + vectorization. */ > + if (partial_vectors_supported_p () > + && param_vect_partial_vector_usage != 0 > + && !TARGET_SVE2) > + { > + unsigned int i; > + stmt_vec_info stmt_vinfo; > + FOR_EACH_VEC_ELT (m_vinfo->stmt_vec_infos, i, stmt_vinfo) > + { > + if (is_pattern_stmt_p (stmt_vinfo)) > + { > + gimple *stmt = stmt_vinfo->stmt; > + if (is_gimple_call (stmt) > + && gimple_call_internal_p (stmt)) > + { > + enum internal_fn ifn > + = gimple_call_internal_fn (stmt); > + switch (ifn) > + { > + case IFN_AVG_FLOOR: > + case IFN_AVG_CEIL: > + return 1; > + default: > + break; > + } > + } > + } > + } > + } I think we should instead record this during add_stmt_cost, like we do for the other data that this function uses. We need to drop the partial_vectors_supported_p () condition though: enabling SVE must not make Advanced SIMD code worse. So for now, Advanced SIMD-only code will have to suffer the missed unrolling opportunity too. > + } > + > + unsigned int max_unroll_factor = 1; > + aarch64_simd_vec_issue_info const *vec_issue > + = sve ? issue_info->sve : issue_info->advsimd; > + for (auto vec_ops : m_ops) > + { > + /* Limit unroll factor to 4 for now. */ > + unsigned int unroll_factor = 4; Would be good to make this an aarch64-specific --param in case people want to play with different values. > + unsigned int factor > + = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1; > + unsigned int temp; > + > + /* Sanity check, this should never happen. */ > + if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0) > + return 1; > + > + /* Check stores. */ > + if (vec_ops.stores > 0) > + { > + temp = CEIL (factor * vec_issue->stores_per_cycle, > + vec_ops.stores); > + unroll_factor = MIN (unroll_factor, temp); > + } > + > + /* Check loads. */ > + if (vec_ops.loads > 0) > + { > + temp = CEIL (factor * vec_issue->loads_stores_per_cycle, > + vec_ops.loads); This should use loads + stores rather than just loads. > + unroll_factor = MIN (unroll_factor, temp); > + } > + > + /* Check general ops. */ > + if (vec_ops.general_ops > 0) > + { > + temp = CEIL (factor * vec_issue->general_ops_per_cycle, > + vec_ops.general_ops); > + unroll_factor = MIN (unroll_factor, temp); > + } In principle we should also take pred_ops into account when sve_issue_info () is nonnull, while ignoring the pred_ops associated with loop predication (added by analyze_loop_vinfo). I guess we should maintain two pred_op counts. That's just a suggestion for a possible future improvement though. It doesn't need to be done as part of this patch. Thanks, Richard > + max_unroll_factor = MAX (max_unroll_factor, unroll_factor); > + } > + > + /* Make sure unroll factor is power of 2. */ > + return 1 << ceil_log2 (max_unroll_factor); > +} > + > /* BODY_COST is the cost of a vector loop body. Adjust the cost as necessary > and return the new cost. */ > unsigned int > @@ -16904,8 +17004,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) > if (loop_vinfo > && m_vec_flags > && aarch64_use_new_vector_costs_p ()) > - m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, > - m_costs[vect_body]); > + { > + m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, > + m_costs[vect_body]); > + m_suggested_unroll_factor = determine_suggested_unroll_factor (); > + } > > /* Apply the heuristic described above m_stp_sequence_cost. Prefer > the scalar code in the event of a tie, since there is more chance
Hi, Addressed all of your comments bar the pred ops one. Is this OK? gcc/ChangeLog: * config/aarch64/aarch64.cc (aarch64_vector_costs): Define determine_suggested_unroll_factor and m_nosve_pattern. (determine_suggested_unroll_factor): New function. (aarch64_vector_costs::add_stmt_cost): Check for a qualifying pattern to set m_nosve_pattern. (aarch64_vector_costs::finish_costs): Use determine_suggested_unroll_factor. * config/aarch64/aarch64.opt (aarch64-vect-unroll-limit): New. On 16/03/2022 18:01, Richard Sandiford wrote: > "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes: >> Hi, >> >> This patch implements the costing function >> determine_suggested_unroll_factor for aarch64. >> It determines the unrolling factor by dividing the number of X >> operations we can do per cycle by the number of X operations in the loop >> body, taking this information from the vec_ops analysis during vector >> costing and the available issue_info information. >> We multiply the dividend by a potential reduction_latency, to improve >> our pipeline utilization if we are stalled waiting on a particular >> reduction operation. >> >> Right now we also have a work around for vectorization choices where the >> main loop uses a NEON mode and predication is available, such that if >> the main loop makes use of a NEON pattern that is not directly supported >> by SVE we do not unroll, as that might cause performance regressions in >> cases where we would enter the original main loop's VF. As an example if >> you have a loop where you could use AVG_CEIL with a V8HI mode, you would >> originally get 8x NEON using AVG_CEIL followed by a 8x SVE predicated >> epilogue, using other instructions. Whereas with the unrolling you would >> end up with 16x AVG_CEIL NEON + 8x SVE predicated loop, thus skipping >> the original 8x NEON. In the future, we could handle this differently, >> by either using a different costing model for epilogues, or potentially >> vectorizing more than one single epilogue. >> >> gcc/ChangeLog: >> >> * config/aarch64/aarch64.cc (aarch64_vector_costs): Define >> determine_suggested_unroll_factor. >> (determine_suggested_unroll_factor): New function. >> (aarch64_vector_costs::finish_costs): Use >> determine_suggested_unroll_factor. >> >> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc >> index b5687aab59f630920e51b742b80a540c3a56c6c8..9d3a607d378d6a2792efa7c6dece2a65c24e4521 100644 >> --- a/gcc/config/aarch64/aarch64.cc >> +++ b/gcc/config/aarch64/aarch64.cc >> @@ -15680,6 +15680,7 @@ private: >> unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *, >> unsigned int); >> bool prefer_unrolled_loop () const; >> + unsigned int determine_suggested_unroll_factor (); >> >> /* True if we have performed one-time initialization based on the >> vec_info. */ >> @@ -16768,6 +16769,105 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops, >> return sve_cycles_per_iter; >> } >> >> +unsigned int >> +aarch64_vector_costs::determine_suggested_unroll_factor () >> +{ >> + auto *issue_info = aarch64_tune_params.vec_costs->issue_info; >> + if (!issue_info) >> + return 1; >> + bool sve = false; >> + if (aarch64_sve_mode_p (m_vinfo->vector_mode)) > Other code uses m_vec_flags & VEC_ANY_SVE for this. > >> + { >> + if (!issue_info->sve) >> + return 1; >> + sve = true; >> + } >> + else >> + { >> + if (!issue_info->advsimd) >> + return 1; > The issue info should instead be taken from vec_ops.simd_issue_info () > in the loop below. It can vary for each entry. > >> + /* If we are trying to unroll a NEON main loop that contains patterns > s/a NEON/an Advanced SIMD/ > >> + that we do not support with SVE and we might use a predicated >> + epilogue, we need to be conservative and block unrolling as this might >> + lead to a less optimal loop for the first and only epilogue using the >> + original loop's vectorization factor. >> + TODO: Remove this constraint when we add support for multiple epilogue >> + vectorization. */ >> + if (partial_vectors_supported_p () >> + && param_vect_partial_vector_usage != 0 >> + && !TARGET_SVE2) >> + { >> + unsigned int i; >> + stmt_vec_info stmt_vinfo; >> + FOR_EACH_VEC_ELT (m_vinfo->stmt_vec_infos, i, stmt_vinfo) >> + { >> + if (is_pattern_stmt_p (stmt_vinfo)) >> + { >> + gimple *stmt = stmt_vinfo->stmt; >> + if (is_gimple_call (stmt) >> + && gimple_call_internal_p (stmt)) >> + { >> + enum internal_fn ifn >> + = gimple_call_internal_fn (stmt); >> + switch (ifn) >> + { >> + case IFN_AVG_FLOOR: >> + case IFN_AVG_CEIL: >> + return 1; >> + default: >> + break; >> + } >> + } >> + } >> + } >> + } > I think we should instead record this during add_stmt_cost, like we do > for the other data that this function uses. > > We need to drop the partial_vectors_supported_p () condition though: > enabling SVE must not make Advanced SIMD code worse. So for now, > Advanced SIMD-only code will have to suffer the missed unrolling > opportunity too. > >> + } >> + >> + unsigned int max_unroll_factor = 1; >> + aarch64_simd_vec_issue_info const *vec_issue >> + = sve ? issue_info->sve : issue_info->advsimd; >> + for (auto vec_ops : m_ops) >> + { >> + /* Limit unroll factor to 4 for now. */ >> + unsigned int unroll_factor = 4; > Would be good to make this an aarch64-specific --param in case people > want to play with different values. > >> + unsigned int factor >> + = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1; >> + unsigned int temp; >> + >> + /* Sanity check, this should never happen. */ >> + if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0) >> + return 1; >> + >> + /* Check stores. */ >> + if (vec_ops.stores > 0) >> + { >> + temp = CEIL (factor * vec_issue->stores_per_cycle, >> + vec_ops.stores); >> + unroll_factor = MIN (unroll_factor, temp); >> + } >> + >> + /* Check loads. */ >> + if (vec_ops.loads > 0) >> + { >> + temp = CEIL (factor * vec_issue->loads_stores_per_cycle, >> + vec_ops.loads); > This should use loads + stores rather than just loads. > >> + unroll_factor = MIN (unroll_factor, temp); >> + } >> + >> + /* Check general ops. */ >> + if (vec_ops.general_ops > 0) >> + { >> + temp = CEIL (factor * vec_issue->general_ops_per_cycle, >> + vec_ops.general_ops); >> + unroll_factor = MIN (unroll_factor, temp); >> + } > In principle we should also take pred_ops into account when > sve_issue_info () is nonnull, while ignoring the pred_ops associated > with loop predication (added by analyze_loop_vinfo). I guess we > should maintain two pred_op counts. > > That's just a suggestion for a possible future improvement though. > It doesn't need to be done as part of this patch. > > Thanks, > Richard > >> + max_unroll_factor = MAX (max_unroll_factor, unroll_factor); >> + } >> + >> + /* Make sure unroll factor is power of 2. */ >> + return 1 << ceil_log2 (max_unroll_factor); >> +} >> + >> /* BODY_COST is the cost of a vector loop body. Adjust the cost as necessary >> and return the new cost. */ >> unsigned int >> @@ -16904,8 +17004,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) >> if (loop_vinfo >> && m_vec_flags >> && aarch64_use_new_vector_costs_p ()) >> - m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >> - m_costs[vect_body]); >> + { >> + m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >> + m_costs[vect_body]); >> + m_suggested_unroll_factor = determine_suggested_unroll_factor (); >> + } >> >> /* Apply the heuristic described above m_stp_sequence_cost. Prefer >> the scalar code in the event of a tie, since there is more chance diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 5da3d14dc357ba01351bca961af4f100a89665e1..3e4595889ce72ff4f1a1e603d5c4fa93be6a8ff4 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -15637,6 +15637,7 @@ private: unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *, unsigned int); bool prefer_unrolled_loop () const; + unsigned int determine_suggested_unroll_factor (); /* True if we have performed one-time initialization based on the vec_info. */ @@ -15699,6 +15700,10 @@ private: or vector loop. There is one entry for each tuning option of interest. */ auto_vec<aarch64_vec_op_count, 2> m_ops; + + /* This loop uses a pattern that is not supported by SVE, but is supported by + Advanced SIMD and SVE2. */ + bool m_nosve_pattern = false; }; aarch64_vector_costs::aarch64_vector_costs (vec_info *vinfo, @@ -16642,6 +16647,25 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, as one iteration of the SVE loop. */ if (where == vect_body && m_unrolled_advsimd_niters) m_unrolled_advsimd_stmts += count * m_unrolled_advsimd_niters; + + if (is_pattern_stmt_p (stmt_info)) + { + gimple *stmt = stmt_info->stmt; + if (is_gimple_call (stmt) + && gimple_call_internal_p (stmt)) + { + enum internal_fn ifn + = gimple_call_internal_fn (stmt); + switch (ifn) + { + case IFN_AVG_FLOOR: + case IFN_AVG_CEIL: + m_nosve_pattern = true; + default: + break; + } + } + } } return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ()); } @@ -16725,6 +16749,68 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops, return sve_cycles_per_iter; } +unsigned int +aarch64_vector_costs::determine_suggested_unroll_factor () +{ + bool sve = m_vec_flags & VEC_ANY_SVE; + /* If we are trying to unroll an Advanced SIMD main loop that contains + patterns that we do not support with SVE and we might use a predicated + epilogue, we need to be conservative and block unrolling as this might + lead to a less optimal loop for the first and only epilogue using the + original loop's vectorization factor. + TODO: Remove this constraint when we add support for multiple epilogue + vectorization. */ + if (!sve && TARGET_SVE2 && m_nosve_pattern) + return 1; + + unsigned int max_unroll_factor = 1; + for (auto vec_ops : m_ops) + { + aarch64_simd_vec_issue_info const *vec_issue + = vec_ops.simd_issue_info (); + if (!vec_issue) + return 1; + /* Limit unroll factor to a value adjustable by the user, the default + value is 4. */ + unsigned int unroll_factor = aarch64_vect_unroll_limit; + unsigned int factor + = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1; + unsigned int temp; + + /* Sanity check, this should never happen. */ + if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0) + return 1; + + /* Check stores. */ + if (vec_ops.stores > 0) + { + temp = CEIL (factor * vec_issue->stores_per_cycle, + vec_ops.stores); + unroll_factor = MIN (unroll_factor, temp); + } + + /* Check loads + stores. */ + if (vec_ops.loads > 0) + { + temp = CEIL (factor * vec_issue->loads_stores_per_cycle, + vec_ops.loads + vec_ops.stores); + unroll_factor = MIN (unroll_factor, temp); + } + + /* Check general ops. */ + if (vec_ops.general_ops > 0) + { + temp = CEIL (factor * vec_issue->general_ops_per_cycle, + vec_ops.general_ops); + unroll_factor = MIN (unroll_factor, temp); + } + max_unroll_factor = MAX (max_unroll_factor, unroll_factor); + } + + /* Make sure unroll factor is power of 2. */ + return 1 << ceil_log2 (max_unroll_factor); +} + /* BODY_COST is the cost of a vector loop body. Adjust the cost as necessary and return the new cost. */ unsigned int @@ -16861,8 +16947,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) if (loop_vinfo && m_vec_flags && aarch64_use_new_vector_costs_p ()) - m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, - m_costs[vect_body]); + { + m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, + m_costs[vect_body]); + m_suggested_unroll_factor = determine_suggested_unroll_factor (); + } /* Apply the heuristic described above m_stp_sequence_cost. Prefer the scalar code in the event of a tie, since there is more chance diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt index 98ce9c0ab61709b29bd29f3853f025e3a5a1bef2..92220b26ee2bf9f95c9a387c3155779596ee5ad5 100644 --- a/gcc/config/aarch64/aarch64.opt +++ b/gcc/config/aarch64/aarch64.opt @@ -292,3 +292,7 @@ Constant memmove size in bytes above which to start using MOPS sequence. -param=aarch64-mops-memset-size-threshold= Target Joined UInteger Var(aarch64_mops_memset_size_threshold) Init(256) Param Constant memset size in bytes from which to start using MOPS sequence. + +-param=aarch64-vect-unroll-limit= +Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param +Limit how much the autovectorizer may unroll a loop.
"Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes: > Hi, > > Addressed all of your comments bar the pred ops one. > > Is this OK? > > > gcc/ChangeLog: > > * config/aarch64/aarch64.cc (aarch64_vector_costs): Define > determine_suggested_unroll_factor and m_nosve_pattern. > (determine_suggested_unroll_factor): New function. > (aarch64_vector_costs::add_stmt_cost): Check for a qualifying > pattern > to set m_nosve_pattern. > (aarch64_vector_costs::finish_costs): Use > determine_suggested_unroll_factor. > * config/aarch64/aarch64.opt (aarch64-vect-unroll-limit): New. > > On 16/03/2022 18:01, Richard Sandiford wrote: >> "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes: >>> Hi, >>> >>> This patch implements the costing function >>> determine_suggested_unroll_factor for aarch64. >>> It determines the unrolling factor by dividing the number of X >>> operations we can do per cycle by the number of X operations in the loop >>> body, taking this information from the vec_ops analysis during vector >>> costing and the available issue_info information. >>> We multiply the dividend by a potential reduction_latency, to improve >>> our pipeline utilization if we are stalled waiting on a particular >>> reduction operation. >>> >>> Right now we also have a work around for vectorization choices where the >>> main loop uses a NEON mode and predication is available, such that if >>> the main loop makes use of a NEON pattern that is not directly supported >>> by SVE we do not unroll, as that might cause performance regressions in >>> cases where we would enter the original main loop's VF. As an example if >>> you have a loop where you could use AVG_CEIL with a V8HI mode, you would >>> originally get 8x NEON using AVG_CEIL followed by a 8x SVE predicated >>> epilogue, using other instructions. Whereas with the unrolling you would >>> end up with 16x AVG_CEIL NEON + 8x SVE predicated loop, thus skipping >>> the original 8x NEON. In the future, we could handle this differently, >>> by either using a different costing model for epilogues, or potentially >>> vectorizing more than one single epilogue. >>> >>> gcc/ChangeLog: >>> >>> * config/aarch64/aarch64.cc (aarch64_vector_costs): Define >>> determine_suggested_unroll_factor. >>> (determine_suggested_unroll_factor): New function. >>> (aarch64_vector_costs::finish_costs): Use >>> determine_suggested_unroll_factor. >>> >>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc >>> index b5687aab59f630920e51b742b80a540c3a56c6c8..9d3a607d378d6a2792efa7c6dece2a65c24e4521 100644 >>> --- a/gcc/config/aarch64/aarch64.cc >>> +++ b/gcc/config/aarch64/aarch64.cc >>> @@ -15680,6 +15680,7 @@ private: >>> unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *, >>> unsigned int); >>> bool prefer_unrolled_loop () const; >>> + unsigned int determine_suggested_unroll_factor (); >>> >>> /* True if we have performed one-time initialization based on the >>> vec_info. */ >>> @@ -16768,6 +16769,105 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops, >>> return sve_cycles_per_iter; >>> } >>> >>> +unsigned int >>> +aarch64_vector_costs::determine_suggested_unroll_factor () >>> +{ >>> + auto *issue_info = aarch64_tune_params.vec_costs->issue_info; >>> + if (!issue_info) >>> + return 1; >>> + bool sve = false; >>> + if (aarch64_sve_mode_p (m_vinfo->vector_mode)) >> Other code uses m_vec_flags & VEC_ANY_SVE for this. >> >>> + { >>> + if (!issue_info->sve) >>> + return 1; >>> + sve = true; >>> + } >>> + else >>> + { >>> + if (!issue_info->advsimd) >>> + return 1; >> The issue info should instead be taken from vec_ops.simd_issue_info () >> in the loop below. It can vary for each entry. >> >>> + /* If we are trying to unroll a NEON main loop that contains patterns >> s/a NEON/an Advanced SIMD/ >> >>> + that we do not support with SVE and we might use a predicated >>> + epilogue, we need to be conservative and block unrolling as this might >>> + lead to a less optimal loop for the first and only epilogue using the >>> + original loop's vectorization factor. >>> + TODO: Remove this constraint when we add support for multiple epilogue >>> + vectorization. */ >>> + if (partial_vectors_supported_p () >>> + && param_vect_partial_vector_usage != 0 >>> + && !TARGET_SVE2) >>> + { >>> + unsigned int i; >>> + stmt_vec_info stmt_vinfo; >>> + FOR_EACH_VEC_ELT (m_vinfo->stmt_vec_infos, i, stmt_vinfo) >>> + { >>> + if (is_pattern_stmt_p (stmt_vinfo)) >>> + { >>> + gimple *stmt = stmt_vinfo->stmt; >>> + if (is_gimple_call (stmt) >>> + && gimple_call_internal_p (stmt)) >>> + { >>> + enum internal_fn ifn >>> + = gimple_call_internal_fn (stmt); >>> + switch (ifn) >>> + { >>> + case IFN_AVG_FLOOR: >>> + case IFN_AVG_CEIL: >>> + return 1; >>> + default: >>> + break; >>> + } >>> + } >>> + } >>> + } >>> + } >> I think we should instead record this during add_stmt_cost, like we do >> for the other data that this function uses. >> >> We need to drop the partial_vectors_supported_p () condition though: >> enabling SVE must not make Advanced SIMD code worse. So for now, >> Advanced SIMD-only code will have to suffer the missed unrolling >> opportunity too. >> >>> + } >>> + >>> + unsigned int max_unroll_factor = 1; >>> + aarch64_simd_vec_issue_info const *vec_issue >>> + = sve ? issue_info->sve : issue_info->advsimd; >>> + for (auto vec_ops : m_ops) >>> + { >>> + /* Limit unroll factor to 4 for now. */ >>> + unsigned int unroll_factor = 4; >> Would be good to make this an aarch64-specific --param in case people >> want to play with different values. >> >>> + unsigned int factor >>> + = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1; >>> + unsigned int temp; >>> + >>> + /* Sanity check, this should never happen. */ >>> + if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0) >>> + return 1; >>> + >>> + /* Check stores. */ >>> + if (vec_ops.stores > 0) >>> + { >>> + temp = CEIL (factor * vec_issue->stores_per_cycle, >>> + vec_ops.stores); >>> + unroll_factor = MIN (unroll_factor, temp); >>> + } >>> + >>> + /* Check loads. */ >>> + if (vec_ops.loads > 0) >>> + { >>> + temp = CEIL (factor * vec_issue->loads_stores_per_cycle, >>> + vec_ops.loads); >> This should use loads + stores rather than just loads. >> >>> + unroll_factor = MIN (unroll_factor, temp); >>> + } >>> + >>> + /* Check general ops. */ >>> + if (vec_ops.general_ops > 0) >>> + { >>> + temp = CEIL (factor * vec_issue->general_ops_per_cycle, >>> + vec_ops.general_ops); >>> + unroll_factor = MIN (unroll_factor, temp); >>> + } >> In principle we should also take pred_ops into account when >> sve_issue_info () is nonnull, while ignoring the pred_ops associated >> with loop predication (added by analyze_loop_vinfo). I guess we >> should maintain two pred_op counts. >> >> That's just a suggestion for a possible future improvement though. >> It doesn't need to be done as part of this patch. >> >> Thanks, >> Richard >> >>> + max_unroll_factor = MAX (max_unroll_factor, unroll_factor); >>> + } >>> + >>> + /* Make sure unroll factor is power of 2. */ >>> + return 1 << ceil_log2 (max_unroll_factor); >>> +} >>> + >>> /* BODY_COST is the cost of a vector loop body. Adjust the cost as necessary >>> and return the new cost. */ >>> unsigned int >>> @@ -16904,8 +17004,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) >>> if (loop_vinfo >>> && m_vec_flags >>> && aarch64_use_new_vector_costs_p ()) >>> - m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >>> - m_costs[vect_body]); >>> + { >>> + m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >>> + m_costs[vect_body]); >>> + m_suggested_unroll_factor = determine_suggested_unroll_factor (); >>> + } >>> >>> /* Apply the heuristic described above m_stp_sequence_cost. Prefer >>> the scalar code in the event of a tie, since there is more chance > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index 5da3d14dc357ba01351bca961af4f100a89665e1..3e4595889ce72ff4f1a1e603d5c4fa93be6a8ff4 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -15637,6 +15637,7 @@ private: > unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *, > unsigned int); > bool prefer_unrolled_loop () const; > + unsigned int determine_suggested_unroll_factor (); > > /* True if we have performed one-time initialization based on the > vec_info. */ > @@ -15699,6 +15700,10 @@ private: > or vector loop. There is one entry for each tuning option of > interest. */ > auto_vec<aarch64_vec_op_count, 2> m_ops; > + > + /* This loop uses a pattern that is not supported by SVE, but is supported by > + Advanced SIMD and SVE2. */ > + bool m_nosve_pattern = false; This is true for other things besides AVG_FLOOR/CEIL. I think we should just accept that this is an x264 hack ;-) and name it “m_has_avg” instead. Could you put it with the other bool field? That would lead to a nicer layout, although it hardly matters much here. > }; > > aarch64_vector_costs::aarch64_vector_costs (vec_info *vinfo, > @@ -16642,6 +16647,25 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > as one iteration of the SVE loop. */ > if (where == vect_body && m_unrolled_advsimd_niters) > m_unrolled_advsimd_stmts += count * m_unrolled_advsimd_niters; > + > + if (is_pattern_stmt_p (stmt_info)) > + { Not sure this is necessary. We'd want the same thing if (for whatever reason) we had a pre-existing IFN_AVG_FLOOR/CEIL in future. > + gimple *stmt = stmt_info->stmt; > + if (is_gimple_call (stmt) > + && gimple_call_internal_p (stmt)) > + { > + enum internal_fn ifn > + = gimple_call_internal_fn (stmt); > + switch (ifn) Very minor nit, but might as well switch on gimple_call_internal_fn (stmt) directly, to avoid the awkward line break. > + { > + case IFN_AVG_FLOOR: > + case IFN_AVG_CEIL: > + m_nosve_pattern = true; > + default: > + break; > + } > + } > + } > } > return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ()); > } > @@ -16725,6 +16749,68 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops, > return sve_cycles_per_iter; > } > > +unsigned int > +aarch64_vector_costs::determine_suggested_unroll_factor () > +{ > + bool sve = m_vec_flags & VEC_ANY_SVE; > + /* If we are trying to unroll an Advanced SIMD main loop that contains > + patterns that we do not support with SVE and we might use a predicated > + epilogue, we need to be conservative and block unrolling as this might > + lead to a less optimal loop for the first and only epilogue using the > + original loop's vectorization factor. > + TODO: Remove this constraint when we add support for multiple epilogue > + vectorization. */ > + if (!sve && TARGET_SVE2 && m_nosve_pattern) > + return 1; Shouldn't this be !TARGET_SVE2? > + > + unsigned int max_unroll_factor = 1; > + for (auto vec_ops : m_ops) > + { > + aarch64_simd_vec_issue_info const *vec_issue > + = vec_ops.simd_issue_info (); > + if (!vec_issue) > + return 1; > + /* Limit unroll factor to a value adjustable by the user, the default > + value is 4. */ > + unsigned int unroll_factor = aarch64_vect_unroll_limit; > + unsigned int factor > + = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1; > + unsigned int temp; > + > + /* Sanity check, this should never happen. */ > + if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0) > + return 1; > + > + /* Check stores. */ > + if (vec_ops.stores > 0) > + { > + temp = CEIL (factor * vec_issue->stores_per_cycle, > + vec_ops.stores); > + unroll_factor = MIN (unroll_factor, temp); > + } > + > + /* Check loads + stores. */ > + if (vec_ops.loads > 0) > + { > + temp = CEIL (factor * vec_issue->loads_stores_per_cycle, > + vec_ops.loads + vec_ops.stores); > + unroll_factor = MIN (unroll_factor, temp); > + } > + > + /* Check general ops. */ > + if (vec_ops.general_ops > 0) > + { > + temp = CEIL (factor * vec_issue->general_ops_per_cycle, > + vec_ops.general_ops); > + unroll_factor = MIN (unroll_factor, temp); > + } > + max_unroll_factor = MAX (max_unroll_factor, unroll_factor); > + } > + > + /* Make sure unroll factor is power of 2. */ > + return 1 << ceil_log2 (max_unroll_factor); > +} > + > /* BODY_COST is the cost of a vector loop body. Adjust the cost as necessary > and return the new cost. */ > unsigned int > @@ -16861,8 +16947,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) > if (loop_vinfo > && m_vec_flags > && aarch64_use_new_vector_costs_p ()) > - m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, > - m_costs[vect_body]); > + { > + m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, > + m_costs[vect_body]); > + m_suggested_unroll_factor = determine_suggested_unroll_factor (); > + } > > /* Apply the heuristic described above m_stp_sequence_cost. Prefer > the scalar code in the event of a tie, since there is more chance > diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt > index 98ce9c0ab61709b29bd29f3853f025e3a5a1bef2..92220b26ee2bf9f95c9a387c3155779596ee5ad5 100644 > --- a/gcc/config/aarch64/aarch64.opt > +++ b/gcc/config/aarch64/aarch64.opt > @@ -292,3 +292,7 @@ Constant memmove size in bytes above which to start using MOPS sequence. > -param=aarch64-mops-memset-size-threshold= > Target Joined UInteger Var(aarch64_mops_memset_size_threshold) Init(256) Param > Constant memset size in bytes from which to start using MOPS sequence. > + > +-param=aarch64-vect-unroll-limit= > +Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param > +Limit how much the autovectorizer may unroll a loop. This needs to be documented in invoke.texi, alongside the existing AArch64-specific params. OK with those changes, thanks. Richard
On 28/03/2022 15:59, Richard Sandiford wrote: > "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes: >> Hi, >> >> Addressed all of your comments bar the pred ops one. >> >> Is this OK? >> >> >> gcc/ChangeLog: >> >> * config/aarch64/aarch64.cc (aarch64_vector_costs): Define >> determine_suggested_unroll_factor and m_nosve_pattern. >> (determine_suggested_unroll_factor): New function. >> (aarch64_vector_costs::add_stmt_cost): Check for a qualifying >> pattern >> to set m_nosve_pattern. >> (aarch64_vector_costs::finish_costs): Use >> determine_suggested_unroll_factor. >> * config/aarch64/aarch64.opt (aarch64-vect-unroll-limit): New. >> >> On 16/03/2022 18:01, Richard Sandiford wrote: >>> "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes: >>>> Hi, >>>> >>>> This patch implements the costing function >>>> determine_suggested_unroll_factor for aarch64. >>>> It determines the unrolling factor by dividing the number of X >>>> operations we can do per cycle by the number of X operations in the loop >>>> body, taking this information from the vec_ops analysis during vector >>>> costing and the available issue_info information. >>>> We multiply the dividend by a potential reduction_latency, to improve >>>> our pipeline utilization if we are stalled waiting on a particular >>>> reduction operation. >>>> >>>> Right now we also have a work around for vectorization choices where the >>>> main loop uses a NEON mode and predication is available, such that if >>>> the main loop makes use of a NEON pattern that is not directly supported >>>> by SVE we do not unroll, as that might cause performance regressions in >>>> cases where we would enter the original main loop's VF. As an example if >>>> you have a loop where you could use AVG_CEIL with a V8HI mode, you would >>>> originally get 8x NEON using AVG_CEIL followed by a 8x SVE predicated >>>> epilogue, using other instructions. Whereas with the unrolling you would >>>> end up with 16x AVG_CEIL NEON + 8x SVE predicated loop, thus skipping >>>> the original 8x NEON. In the future, we could handle this differently, >>>> by either using a different costing model for epilogues, or potentially >>>> vectorizing more than one single epilogue. >>>> >>>> gcc/ChangeLog: >>>> >>>> * config/aarch64/aarch64.cc (aarch64_vector_costs): Define >>>> determine_suggested_unroll_factor. >>>> (determine_suggested_unroll_factor): New function. >>>> (aarch64_vector_costs::finish_costs): Use >>>> determine_suggested_unroll_factor. >>>> >>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc >>>> index b5687aab59f630920e51b742b80a540c3a56c6c8..9d3a607d378d6a2792efa7c6dece2a65c24e4521 100644 >>>> --- a/gcc/config/aarch64/aarch64.cc >>>> +++ b/gcc/config/aarch64/aarch64.cc >>>> @@ -15680,6 +15680,7 @@ private: >>>> unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *, >>>> unsigned int); >>>> bool prefer_unrolled_loop () const; >>>> + unsigned int determine_suggested_unroll_factor (); >>>> >>>> /* True if we have performed one-time initialization based on the >>>> vec_info. */ >>>> @@ -16768,6 +16769,105 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops, >>>> return sve_cycles_per_iter; >>>> } >>>> >>>> +unsigned int >>>> +aarch64_vector_costs::determine_suggested_unroll_factor () >>>> +{ >>>> + auto *issue_info = aarch64_tune_params.vec_costs->issue_info; >>>> + if (!issue_info) >>>> + return 1; >>>> + bool sve = false; >>>> + if (aarch64_sve_mode_p (m_vinfo->vector_mode)) >>> Other code uses m_vec_flags & VEC_ANY_SVE for this. >>> >>>> + { >>>> + if (!issue_info->sve) >>>> + return 1; >>>> + sve = true; >>>> + } >>>> + else >>>> + { >>>> + if (!issue_info->advsimd) >>>> + return 1; >>> The issue info should instead be taken from vec_ops.simd_issue_info () >>> in the loop below. It can vary for each entry. >>> >>>> + /* If we are trying to unroll a NEON main loop that contains patterns >>> s/a NEON/an Advanced SIMD/ >>> >>>> + that we do not support with SVE and we might use a predicated >>>> + epilogue, we need to be conservative and block unrolling as this might >>>> + lead to a less optimal loop for the first and only epilogue using the >>>> + original loop's vectorization factor. >>>> + TODO: Remove this constraint when we add support for multiple epilogue >>>> + vectorization. */ >>>> + if (partial_vectors_supported_p () >>>> + && param_vect_partial_vector_usage != 0 >>>> + && !TARGET_SVE2) >>>> + { >>>> + unsigned int i; >>>> + stmt_vec_info stmt_vinfo; >>>> + FOR_EACH_VEC_ELT (m_vinfo->stmt_vec_infos, i, stmt_vinfo) >>>> + { >>>> + if (is_pattern_stmt_p (stmt_vinfo)) >>>> + { >>>> + gimple *stmt = stmt_vinfo->stmt; >>>> + if (is_gimple_call (stmt) >>>> + && gimple_call_internal_p (stmt)) >>>> + { >>>> + enum internal_fn ifn >>>> + = gimple_call_internal_fn (stmt); >>>> + switch (ifn) >>>> + { >>>> + case IFN_AVG_FLOOR: >>>> + case IFN_AVG_CEIL: >>>> + return 1; >>>> + default: >>>> + break; >>>> + } >>>> + } >>>> + } >>>> + } >>>> + } >>> I think we should instead record this during add_stmt_cost, like we do >>> for the other data that this function uses. >>> >>> We need to drop the partial_vectors_supported_p () condition though: >>> enabling SVE must not make Advanced SIMD code worse. So for now, >>> Advanced SIMD-only code will have to suffer the missed unrolling >>> opportunity too. >>> >>>> + } >>>> + >>>> + unsigned int max_unroll_factor = 1; >>>> + aarch64_simd_vec_issue_info const *vec_issue >>>> + = sve ? issue_info->sve : issue_info->advsimd; >>>> + for (auto vec_ops : m_ops) >>>> + { >>>> + /* Limit unroll factor to 4 for now. */ >>>> + unsigned int unroll_factor = 4; >>> Would be good to make this an aarch64-specific --param in case people >>> want to play with different values. >>> >>>> + unsigned int factor >>>> + = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1; >>>> + unsigned int temp; >>>> + >>>> + /* Sanity check, this should never happen. */ >>>> + if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0) >>>> + return 1; >>>> + >>>> + /* Check stores. */ >>>> + if (vec_ops.stores > 0) >>>> + { >>>> + temp = CEIL (factor * vec_issue->stores_per_cycle, >>>> + vec_ops.stores); >>>> + unroll_factor = MIN (unroll_factor, temp); >>>> + } >>>> + >>>> + /* Check loads. */ >>>> + if (vec_ops.loads > 0) >>>> + { >>>> + temp = CEIL (factor * vec_issue->loads_stores_per_cycle, >>>> + vec_ops.loads); >>> This should use loads + stores rather than just loads. >>> >>>> + unroll_factor = MIN (unroll_factor, temp); >>>> + } >>>> + >>>> + /* Check general ops. */ >>>> + if (vec_ops.general_ops > 0) >>>> + { >>>> + temp = CEIL (factor * vec_issue->general_ops_per_cycle, >>>> + vec_ops.general_ops); >>>> + unroll_factor = MIN (unroll_factor, temp); >>>> + } >>> In principle we should also take pred_ops into account when >>> sve_issue_info () is nonnull, while ignoring the pred_ops associated >>> with loop predication (added by analyze_loop_vinfo). I guess we >>> should maintain two pred_op counts. >>> >>> That's just a suggestion for a possible future improvement though. >>> It doesn't need to be done as part of this patch. >>> >>> Thanks, >>> Richard >>> >>>> + max_unroll_factor = MAX (max_unroll_factor, unroll_factor); >>>> + } >>>> + >>>> + /* Make sure unroll factor is power of 2. */ >>>> + return 1 << ceil_log2 (max_unroll_factor); >>>> +} >>>> + >>>> /* BODY_COST is the cost of a vector loop body. Adjust the cost as necessary >>>> and return the new cost. */ >>>> unsigned int >>>> @@ -16904,8 +17004,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) >>>> if (loop_vinfo >>>> && m_vec_flags >>>> && aarch64_use_new_vector_costs_p ()) >>>> - m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >>>> - m_costs[vect_body]); >>>> + { >>>> + m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >>>> + m_costs[vect_body]); >>>> + m_suggested_unroll_factor = determine_suggested_unroll_factor (); >>>> + } >>>> >>>> /* Apply the heuristic described above m_stp_sequence_cost. Prefer >>>> the scalar code in the event of a tie, since there is more chance >> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc >> index 5da3d14dc357ba01351bca961af4f100a89665e1..3e4595889ce72ff4f1a1e603d5c4fa93be6a8ff4 100644 >> --- a/gcc/config/aarch64/aarch64.cc >> +++ b/gcc/config/aarch64/aarch64.cc >> @@ -15637,6 +15637,7 @@ private: >> unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *, >> unsigned int); >> bool prefer_unrolled_loop () const; >> + unsigned int determine_suggested_unroll_factor (); >> >> /* True if we have performed one-time initialization based on the >> vec_info. */ >> @@ -15699,6 +15700,10 @@ private: >> or vector loop. There is one entry for each tuning option of >> interest. */ >> auto_vec<aarch64_vec_op_count, 2> m_ops; >> + >> + /* This loop uses a pattern that is not supported by SVE, but is supported by >> + Advanced SIMD and SVE2. */ >> + bool m_nosve_pattern = false; > This is true for other things besides AVG_FLOOR/CEIL. I think we should > just accept that this is an x264 hack ;-) and name it “m_has_avg” instead. > > Could you put it with the other bool field? That would lead to a nicer > layout, although it hardly matters much here. > >> }; >> >> aarch64_vector_costs::aarch64_vector_costs (vec_info *vinfo, >> @@ -16642,6 +16647,25 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, >> as one iteration of the SVE loop. */ >> if (where == vect_body && m_unrolled_advsimd_niters) >> m_unrolled_advsimd_stmts += count * m_unrolled_advsimd_niters; >> + >> + if (is_pattern_stmt_p (stmt_info)) >> + { > Not sure this is necessary. We'd want the same thing if (for whatever > reason) we had a pre-existing IFN_AVG_FLOOR/CEIL in future. > >> + gimple *stmt = stmt_info->stmt; >> + if (is_gimple_call (stmt) >> + && gimple_call_internal_p (stmt)) >> + { >> + enum internal_fn ifn >> + = gimple_call_internal_fn (stmt); >> + switch (ifn) > Very minor nit, but might as well switch on gimple_call_internal_fn (stmt) > directly, to avoid the awkward line break. > >> + { >> + case IFN_AVG_FLOOR: >> + case IFN_AVG_CEIL: >> + m_nosve_pattern = true; >> + default: >> + break; >> + } >> + } >> + } >> } >> return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ()); >> } >> @@ -16725,6 +16749,68 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops, >> return sve_cycles_per_iter; >> } >> >> +unsigned int >> +aarch64_vector_costs::determine_suggested_unroll_factor () >> +{ >> + bool sve = m_vec_flags & VEC_ANY_SVE; >> + /* If we are trying to unroll an Advanced SIMD main loop that contains >> + patterns that we do not support with SVE and we might use a predicated >> + epilogue, we need to be conservative and block unrolling as this might >> + lead to a less optimal loop for the first and only epilogue using the >> + original loop's vectorization factor. >> + TODO: Remove this constraint when we add support for multiple epilogue >> + vectorization. */ >> + if (!sve && TARGET_SVE2 && m_nosve_pattern) >> + return 1; > Shouldn't this be !TARGET_SVE2? > >> + >> + unsigned int max_unroll_factor = 1; >> + for (auto vec_ops : m_ops) >> + { >> + aarch64_simd_vec_issue_info const *vec_issue >> + = vec_ops.simd_issue_info (); >> + if (!vec_issue) >> + return 1; >> + /* Limit unroll factor to a value adjustable by the user, the default >> + value is 4. */ >> + unsigned int unroll_factor = aarch64_vect_unroll_limit; >> + unsigned int factor >> + = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1; >> + unsigned int temp; >> + >> + /* Sanity check, this should never happen. */ >> + if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0) >> + return 1; >> + >> + /* Check stores. */ >> + if (vec_ops.stores > 0) >> + { >> + temp = CEIL (factor * vec_issue->stores_per_cycle, >> + vec_ops.stores); >> + unroll_factor = MIN (unroll_factor, temp); >> + } >> + >> + /* Check loads + stores. */ >> + if (vec_ops.loads > 0) >> + { >> + temp = CEIL (factor * vec_issue->loads_stores_per_cycle, >> + vec_ops.loads + vec_ops.stores); >> + unroll_factor = MIN (unroll_factor, temp); >> + } >> + >> + /* Check general ops. */ >> + if (vec_ops.general_ops > 0) >> + { >> + temp = CEIL (factor * vec_issue->general_ops_per_cycle, >> + vec_ops.general_ops); >> + unroll_factor = MIN (unroll_factor, temp); >> + } >> + max_unroll_factor = MAX (max_unroll_factor, unroll_factor); >> + } >> + >> + /* Make sure unroll factor is power of 2. */ >> + return 1 << ceil_log2 (max_unroll_factor); >> +} >> + >> /* BODY_COST is the cost of a vector loop body. Adjust the cost as necessary >> and return the new cost. */ >> unsigned int >> @@ -16861,8 +16947,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) >> if (loop_vinfo >> && m_vec_flags >> && aarch64_use_new_vector_costs_p ()) >> - m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >> - m_costs[vect_body]); >> + { >> + m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >> + m_costs[vect_body]); >> + m_suggested_unroll_factor = determine_suggested_unroll_factor (); >> + } >> >> /* Apply the heuristic described above m_stp_sequence_cost. Prefer >> the scalar code in the event of a tie, since there is more chance >> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt >> index 98ce9c0ab61709b29bd29f3853f025e3a5a1bef2..92220b26ee2bf9f95c9a387c3155779596ee5ad5 100644 >> --- a/gcc/config/aarch64/aarch64.opt >> +++ b/gcc/config/aarch64/aarch64.opt >> @@ -292,3 +292,7 @@ Constant memmove size in bytes above which to start using MOPS sequence. >> -param=aarch64-mops-memset-size-threshold= >> Target Joined UInteger Var(aarch64_mops_memset_size_threshold) Init(256) Param >> Constant memset size in bytes from which to start using MOPS sequence. >> + >> +-param=aarch64-vect-unroll-limit= >> +Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param >> +Limit how much the autovectorizer may unroll a loop. > This needs to be documented in invoke.texi, alongside the existing > AArch64-specific params. > > OK with those changes, thanks. > > Richard Committed attached patch. diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 5da3d14dc357ba01351bca961af4f100a89665e1..d66626748e5edc46e46edf10e243114a9f74be97 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -15637,11 +15637,16 @@ private: unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *, unsigned int); bool prefer_unrolled_loop () const; + unsigned int determine_suggested_unroll_factor (); /* True if we have performed one-time initialization based on the vec_info. */ bool m_analyzed_vinfo = false; + /* This loop uses an average operation that is not supported by SVE, but is + supported by Advanced SIMD and SVE2. */ + bool m_has_avg = false; + /* - If M_VEC_FLAGS is zero then we're costing the original scalar code. - If M_VEC_FLAGS & VEC_ADVSIMD is nonzero then we're costing Advanced SIMD code. @@ -16642,6 +16647,21 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, as one iteration of the SVE loop. */ if (where == vect_body && m_unrolled_advsimd_niters) m_unrolled_advsimd_stmts += count * m_unrolled_advsimd_niters; + + /* Detect the use of an averaging operation. */ + gimple *stmt = stmt_info->stmt; + if (is_gimple_call (stmt) + && gimple_call_internal_p (stmt)) + { + switch (gimple_call_internal_fn (stmt)) + { + case IFN_AVG_FLOOR: + case IFN_AVG_CEIL: + m_has_avg = true; + default: + break; + } + } } return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ()); } @@ -16725,6 +16745,68 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops, return sve_cycles_per_iter; } +unsigned int +aarch64_vector_costs::determine_suggested_unroll_factor () +{ + bool sve = m_vec_flags & VEC_ANY_SVE; + /* If we are trying to unroll an Advanced SIMD main loop that contains + an averaging operation that we do not support with SVE and we might use a + predicated epilogue, we need to be conservative and block unrolling as + this might lead to a less optimal loop for the first and only epilogue + using the original loop's vectorization factor. + TODO: Remove this constraint when we add support for multiple epilogue + vectorization. */ + if (!sve && !TARGET_SVE2 && m_has_avg) + return 1; + + unsigned int max_unroll_factor = 1; + for (auto vec_ops : m_ops) + { + aarch64_simd_vec_issue_info const *vec_issue + = vec_ops.simd_issue_info (); + if (!vec_issue) + return 1; + /* Limit unroll factor to a value adjustable by the user, the default + value is 4. */ + unsigned int unroll_factor = aarch64_vect_unroll_limit; + unsigned int factor + = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1; + unsigned int temp; + + /* Sanity check, this should never happen. */ + if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0) + return 1; + + /* Check stores. */ + if (vec_ops.stores > 0) + { + temp = CEIL (factor * vec_issue->stores_per_cycle, + vec_ops.stores); + unroll_factor = MIN (unroll_factor, temp); + } + + /* Check loads + stores. */ + if (vec_ops.loads > 0) + { + temp = CEIL (factor * vec_issue->loads_stores_per_cycle, + vec_ops.loads + vec_ops.stores); + unroll_factor = MIN (unroll_factor, temp); + } + + /* Check general ops. */ + if (vec_ops.general_ops > 0) + { + temp = CEIL (factor * vec_issue->general_ops_per_cycle, + vec_ops.general_ops); + unroll_factor = MIN (unroll_factor, temp); + } + max_unroll_factor = MAX (max_unroll_factor, unroll_factor); + } + + /* Make sure unroll factor is power of 2. */ + return 1 << ceil_log2 (max_unroll_factor); +} + /* BODY_COST is the cost of a vector loop body. Adjust the cost as necessary and return the new cost. */ unsigned int @@ -16861,8 +16943,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) if (loop_vinfo && m_vec_flags && aarch64_use_new_vector_costs_p ()) - m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, - m_costs[vect_body]); + { + m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, + m_costs[vect_body]); + m_suggested_unroll_factor = determine_suggested_unroll_factor (); + } /* Apply the heuristic described above m_stp_sequence_cost. Prefer the scalar code in the event of a tie, since there is more chance diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt index 98ce9c0ab61709b29bd29f3853f025e3a5a1bef2..92220b26ee2bf9f95c9a387c3155779596ee5ad5 100644 --- a/gcc/config/aarch64/aarch64.opt +++ b/gcc/config/aarch64/aarch64.opt @@ -292,3 +292,7 @@ Constant memmove size in bytes above which to start using MOPS sequence. -param=aarch64-mops-memset-size-threshold= Target Joined UInteger Var(aarch64_mops_memset_size_threshold) Init(256) Param Constant memset size in bytes from which to start using MOPS sequence. + +-param=aarch64-vect-unroll-limit= +Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param +Limit how much the autovectorizer may unroll a loop. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index fb3dec4ab0c37810d4409fa6da6a82f0154b0a3a..8c27d55d1edf8bc0f4df6fed7e1df3523bc9686c 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -15174,6 +15174,12 @@ If this parameter is set to @var{n}, GCC will not use this heuristic for loops that are known to execute in fewer than @var{n} Advanced SIMD iterations. +@item aarch64-vect-unroll-limit +The vectorizer will use available tuning information to determine whether it +would be beneficial to unroll the main vectorized loop and by how much. This +parameter set's the upper bound of how much the vectorizer will unroll the main +loop. The default value is four. + @end table @end table
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index b5687aab59f630920e51b742b80a540c3a56c6c8..9d3a607d378d6a2792efa7c6dece2a65c24e4521 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -15680,6 +15680,7 @@ private: unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *, unsigned int); bool prefer_unrolled_loop () const; + unsigned int determine_suggested_unroll_factor (); /* True if we have performed one-time initialization based on the vec_info. */ @@ -16768,6 +16769,105 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops, return sve_cycles_per_iter; } +unsigned int +aarch64_vector_costs::determine_suggested_unroll_factor () +{ + auto *issue_info = aarch64_tune_params.vec_costs->issue_info; + if (!issue_info) + return 1; + bool sve = false; + if (aarch64_sve_mode_p (m_vinfo->vector_mode)) + { + if (!issue_info->sve) + return 1; + sve = true; + } + else + { + if (!issue_info->advsimd) + return 1; + /* If we are trying to unroll a NEON main loop that contains patterns + that we do not support with SVE and we might use a predicated + epilogue, we need to be conservative and block unrolling as this might + lead to a less optimal loop for the first and only epilogue using the + original loop's vectorization factor. + TODO: Remove this constraint when we add support for multiple epilogue + vectorization. */ + if (partial_vectors_supported_p () + && param_vect_partial_vector_usage != 0 + && !TARGET_SVE2) + { + unsigned int i; + stmt_vec_info stmt_vinfo; + FOR_EACH_VEC_ELT (m_vinfo->stmt_vec_infos, i, stmt_vinfo) + { + if (is_pattern_stmt_p (stmt_vinfo)) + { + gimple *stmt = stmt_vinfo->stmt; + if (is_gimple_call (stmt) + && gimple_call_internal_p (stmt)) + { + enum internal_fn ifn + = gimple_call_internal_fn (stmt); + switch (ifn) + { + case IFN_AVG_FLOOR: + case IFN_AVG_CEIL: + return 1; + default: + break; + } + } + } + } + } + } + + unsigned int max_unroll_factor = 1; + aarch64_simd_vec_issue_info const *vec_issue + = sve ? issue_info->sve : issue_info->advsimd; + for (auto vec_ops : m_ops) + { + /* Limit unroll factor to 4 for now. */ + unsigned int unroll_factor = 4; + unsigned int factor + = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1; + unsigned int temp; + + /* Sanity check, this should never happen. */ + if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0) + return 1; + + /* Check stores. */ + if (vec_ops.stores > 0) + { + temp = CEIL (factor * vec_issue->stores_per_cycle, + vec_ops.stores); + unroll_factor = MIN (unroll_factor, temp); + } + + /* Check loads. */ + if (vec_ops.loads > 0) + { + temp = CEIL (factor * vec_issue->loads_stores_per_cycle, + vec_ops.loads); + unroll_factor = MIN (unroll_factor, temp); + } + + /* Check general ops. */ + if (vec_ops.general_ops > 0) + { + temp = CEIL (factor * vec_issue->general_ops_per_cycle, + vec_ops.general_ops); + unroll_factor = MIN (unroll_factor, temp); + } + max_unroll_factor = MAX (max_unroll_factor, unroll_factor); + } + + /* Make sure unroll factor is power of 2. */ + return 1 << ceil_log2 (max_unroll_factor); +} + /* BODY_COST is the cost of a vector loop body. Adjust the cost as necessary and return the new cost. */ unsigned int @@ -16904,8 +17004,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) if (loop_vinfo && m_vec_flags && aarch64_use_new_vector_costs_p ()) - m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, - m_costs[vect_body]); + { + m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, + m_costs[vect_body]); + m_suggested_unroll_factor = determine_suggested_unroll_factor (); + } /* Apply the heuristic described above m_stp_sequence_cost. Prefer the scalar code in the event of a tie, since there is more chance