Message ID | 20241128102204.2282699-1-christophe.lyon@linaro.org |
---|---|
State | New |
Headers | show |
Series | arm: [MVE intrinsics] fix vctpq intrinsic implementation [PR target/117814] | expand |
Hi Christophe, On 28/11/2024 10:22, Christophe Lyon wrote: > The VCTP instruction creates a Vector Tail Predicate in VPR.P0, based > on the input value, but also constrained by a VPT block (if present), > or if used within a DLSTP/LETP loop. > > Therefore we need to inform the compiler that this intrinsic reads the > FPCXT register, otherwise it could make incorrect assumptions: for > instance in test7() from gcc.target/arm/mve/dlstp-compile-asm-2.c it > would hoist p1 = vctp32q (g) outside of the loop. We chatted about this offlist but it's good to share here for others too. I do not believe the transformation gcc is doing here is wrong. The transformation we do for test 7, along with some others in the testsuite, relies on analysis we do to check whether masks, that are not the loop predicate mask, used within the loop have a side effect. In other words, any instruction that is not predicated by the loop predicate, be that unpredicated or predicated by another mask, triggers an analysis to check whether the results are used in a safe way. Check the comments above 'arm_mve_impl_predicated_p' in arm.cc For test7 the non-loop predicate 'p1' is used to predicate a load inside the loop, when dlstp'ed that load will be masked by 'p & p1' instead, which means it could be loading less than initially intended, however, the results of that load are used in a vadd predicated by 'p' which means any values that it would have loaded if not masked by 'p' would have been discarded in the add, so it has no relevant effect. Furthermore, I also believe the compiler is already aware that VCTP writes P0, given it has an input operand with the predicate 'vpr_register_operand' and the register constraint '=Up'. During DLSTP transformation we rely on reads and writes to such operands to do our transformation and it should also provide other backend passes with enough information. So I don't think this patch is needed.
On Thu, 28 Nov 2024 at 15:13, Andre Vieira (lists) <andre.simoesdiasvieira@arm.com> wrote: > > Hi Christophe, > > On 28/11/2024 10:22, Christophe Lyon wrote: > > The VCTP instruction creates a Vector Tail Predicate in VPR.P0, based > > on the input value, but also constrained by a VPT block (if present), > > or if used within a DLSTP/LETP loop. > > > > Therefore we need to inform the compiler that this intrinsic reads the > > FPCXT register, otherwise it could make incorrect assumptions: for > > instance in test7() from gcc.target/arm/mve/dlstp-compile-asm-2.c it > > would hoist p1 = vctp32q (g) outside of the loop. > > We chatted about this offlist but it's good to share here for others > too. I do not believe the transformation gcc is doing here is wrong. The > transformation we do for test 7, along with some others in the > testsuite, relies on analysis we do to check whether masks, that are not > the loop predicate mask, used within the loop have a side effect. In > other words, any instruction that is not predicated by the loop > predicate, be that unpredicated or predicated by another mask, triggers > an analysis to check whether the results are used in a safe way. Check > the comments above 'arm_mve_impl_predicated_p' in arm.cc > > For test7 the non-loop predicate 'p1' is used to predicate a load inside > the loop, when dlstp'ed that load will be masked by 'p & p1' instead, > which means it could be loading less than initially intended, however, > the results of that load are used in a vadd predicated by 'p' which > means any values that it would have loaded if not masked by 'p' would > have been discarded in the add, so it has no relevant effect. > > Furthermore, I also believe the compiler is already aware that VCTP > writes P0, given it has an input operand with the predicate > 'vpr_register_operand' and the register constraint '=Up'. During DLSTP > transformation we rely on reads and writes to such operands to do our > transformation and it should also provide other backend passes with > enough information. > > So I don't think this patch is needed. > Indeed. I managed to wrongly convince myself that p1 = vctp32q (g) should not be hoisted... Let me drop this patch, but dlstp-compile-asm-2.c still needs fixing. Thanks, Christophe
diff --git a/gcc/config/arm/arm-mve-builtins-base.cc b/gcc/config/arm/arm-mve-builtins-base.cc index 723004b53d7..bc9dcc77515 100644 --- a/gcc/config/arm/arm-mve-builtins-base.cc +++ b/gcc/config/arm/arm-mve-builtins-base.cc @@ -541,6 +541,12 @@ public: /* Mode this intrinsic operates on. */ machine_mode m_mode; + unsigned int + call_properties (const function_instance &) const override + { + return CP_READ_FPCXT; + } + rtx expand (function_expander &e) const override { diff --git a/gcc/config/arm/arm-mve-builtins.cc b/gcc/config/arm/arm-mve-builtins.cc index 30b103ec086..8bbcedd2f15 100644 --- a/gcc/config/arm/arm-mve-builtins.cc +++ b/gcc/config/arm/arm-mve-builtins.cc @@ -785,7 +785,8 @@ function_instance::reads_global_state_p () const if (flags & CP_READ_FPCR) return true; - return flags & CP_READ_MEMORY; + /* Handle direct reads of global state. */ + return flags & (CP_READ_MEMORY | CP_READ_FPCXT); } /* Return true if calls to the function could modify some form of diff --git a/gcc/config/arm/arm-mve-builtins.h b/gcc/config/arm/arm-mve-builtins.h index cdc07b4e51f..d76a10516ba 100644 --- a/gcc/config/arm/arm-mve-builtins.h +++ b/gcc/config/arm/arm-mve-builtins.h @@ -93,6 +93,7 @@ const unsigned int CP_READ_FPCR = 1U << 0; const unsigned int CP_RAISE_FP_EXCEPTIONS = 1U << 1; const unsigned int CP_READ_MEMORY = 1U << 2; const unsigned int CP_WRITE_MEMORY = 1U << 3; +const unsigned int CP_READ_FPCXT = 1U << 4; /* Enumerates the MVE predicate and (data) vector types, together called "vector types" for brevity. */