diff mbox series

arm: [MVE intrinsics] fix vctpq intrinsic implementation [PR target/117814]

Message ID 20241128102204.2282699-1-christophe.lyon@linaro.org
State New
Headers show
Series arm: [MVE intrinsics] fix vctpq intrinsic implementation [PR target/117814] | expand

Commit Message

Christophe Lyon Nov. 28, 2024, 10:22 a.m. UTC
The VCTP instruction creates a Vector Tail Predicate in VPR.P0, based
on the input value, but also constrained by a VPT block (if present),
or if used within a DLSTP/LETP loop.

Therefore we need to inform the compiler that this intrinsic reads the
FPCXT register, otherwise it could make incorrect assumptions: for
instance in test7() from gcc.target/arm/mve/dlstp-compile-asm-2.c it
would hoist p1 = vctp32q (g) outside of the loop.

The patch introduces a new flag CP_READ_FPCXT, which is handled
similarly to CP_READ_MEMORY.

	gcc/ChangeLog:

	PR target/117814
	* config/arm/arm-mve-builtins-base.cc (vctpq_impl): Implement
	call_properties.
	* config/arm/arm-mve-builtins.cc
	(function_instance::reads_global_state_p): Handle CP_READ_FPCXT.
	* config/arm/arm-mve-builtins.h (CP_READ_FPCXT): New flag.
---
 gcc/config/arm/arm-mve-builtins-base.cc | 6 ++++++
 gcc/config/arm/arm-mve-builtins.cc      | 3 ++-
 gcc/config/arm/arm-mve-builtins.h       | 1 +
 3 files changed, 9 insertions(+), 1 deletion(-)

Comments

Andre Vieira (lists) Nov. 28, 2024, 2:13 p.m. UTC | #1
Hi Christophe,

On 28/11/2024 10:22, Christophe Lyon wrote:
> The VCTP instruction creates a Vector Tail Predicate in VPR.P0, based
> on the input value, but also constrained by a VPT block (if present),
> or if used within a DLSTP/LETP loop.
> 
> Therefore we need to inform the compiler that this intrinsic reads the
> FPCXT register, otherwise it could make incorrect assumptions: for
> instance in test7() from gcc.target/arm/mve/dlstp-compile-asm-2.c it
> would hoist p1 = vctp32q (g) outside of the loop.

We chatted about this offlist but it's good to share here for others 
too. I do not believe the transformation gcc is doing here is wrong. The 
transformation we do for test 7, along with some others in the 
testsuite, relies on analysis we do to check whether masks, that are not 
the loop predicate mask, used within the loop have a side effect. In 
other words, any instruction that is not predicated by the loop 
predicate, be that unpredicated or predicated by another mask, triggers 
an analysis to check whether the results are used in a safe way. Check 
the comments above 'arm_mve_impl_predicated_p' in arm.cc

For test7 the non-loop predicate 'p1' is used to predicate a load inside 
the loop, when dlstp'ed that load will be masked by 'p & p1' instead, 
which means it could be loading less than initially intended, however, 
the results of that load are used in a vadd predicated by 'p' which 
means any values that it would have loaded if not masked by 'p' would 
have been discarded in the add, so it has no relevant effect.

Furthermore, I also believe the compiler is already aware that VCTP 
writes P0, given it has an input operand with the predicate 
'vpr_register_operand' and the register constraint '=Up'. During DLSTP 
transformation we rely on reads and writes to such operands to do our 
transformation and it should also provide other backend passes with 
enough information.

So I don't think this patch is needed.
Christophe Lyon Nov. 28, 2024, 4:17 p.m. UTC | #2
On Thu, 28 Nov 2024 at 15:13, Andre Vieira (lists)
<andre.simoesdiasvieira@arm.com> wrote:
>
> Hi Christophe,
>
> On 28/11/2024 10:22, Christophe Lyon wrote:
> > The VCTP instruction creates a Vector Tail Predicate in VPR.P0, based
> > on the input value, but also constrained by a VPT block (if present),
> > or if used within a DLSTP/LETP loop.
> >
> > Therefore we need to inform the compiler that this intrinsic reads the
> > FPCXT register, otherwise it could make incorrect assumptions: for
> > instance in test7() from gcc.target/arm/mve/dlstp-compile-asm-2.c it
> > would hoist p1 = vctp32q (g) outside of the loop.
>
> We chatted about this offlist but it's good to share here for others
> too. I do not believe the transformation gcc is doing here is wrong. The
> transformation we do for test 7, along with some others in the
> testsuite, relies on analysis we do to check whether masks, that are not
> the loop predicate mask, used within the loop have a side effect. In
> other words, any instruction that is not predicated by the loop
> predicate, be that unpredicated or predicated by another mask, triggers
> an analysis to check whether the results are used in a safe way. Check
> the comments above 'arm_mve_impl_predicated_p' in arm.cc
>
> For test7 the non-loop predicate 'p1' is used to predicate a load inside
> the loop, when dlstp'ed that load will be masked by 'p & p1' instead,
> which means it could be loading less than initially intended, however,
> the results of that load are used in a vadd predicated by 'p' which
> means any values that it would have loaded if not masked by 'p' would
> have been discarded in the add, so it has no relevant effect.
>
> Furthermore, I also believe the compiler is already aware that VCTP
> writes P0, given it has an input operand with the predicate
> 'vpr_register_operand' and the register constraint '=Up'. During DLSTP
> transformation we rely on reads and writes to such operands to do our
> transformation and it should also provide other backend passes with
> enough information.
>
> So I don't think this patch is needed.
>
Indeed.
I managed to wrongly convince myself that p1 = vctp32q (g) should not
be hoisted...

Let me drop this patch, but dlstp-compile-asm-2.c still needs fixing.

Thanks,

Christophe
diff mbox series

Patch

diff --git a/gcc/config/arm/arm-mve-builtins-base.cc b/gcc/config/arm/arm-mve-builtins-base.cc
index 723004b53d7..bc9dcc77515 100644
--- a/gcc/config/arm/arm-mve-builtins-base.cc
+++ b/gcc/config/arm/arm-mve-builtins-base.cc
@@ -541,6 +541,12 @@  public:
   /* Mode this intrinsic operates on.  */
   machine_mode m_mode;
 
+  unsigned int
+  call_properties (const function_instance &) const override
+  {
+    return CP_READ_FPCXT;
+  }
+
   rtx
   expand (function_expander &e) const override
   {
diff --git a/gcc/config/arm/arm-mve-builtins.cc b/gcc/config/arm/arm-mve-builtins.cc
index 30b103ec086..8bbcedd2f15 100644
--- a/gcc/config/arm/arm-mve-builtins.cc
+++ b/gcc/config/arm/arm-mve-builtins.cc
@@ -785,7 +785,8 @@  function_instance::reads_global_state_p () const
   if (flags & CP_READ_FPCR)
     return true;
 
-  return flags & CP_READ_MEMORY;
+  /* Handle direct reads of global state.  */
+  return flags & (CP_READ_MEMORY | CP_READ_FPCXT);
 }
 
 /* Return true if calls to the function could modify some form of
diff --git a/gcc/config/arm/arm-mve-builtins.h b/gcc/config/arm/arm-mve-builtins.h
index cdc07b4e51f..d76a10516ba 100644
--- a/gcc/config/arm/arm-mve-builtins.h
+++ b/gcc/config/arm/arm-mve-builtins.h
@@ -93,6 +93,7 @@  const unsigned int CP_READ_FPCR = 1U << 0;
 const unsigned int CP_RAISE_FP_EXCEPTIONS = 1U << 1;
 const unsigned int CP_READ_MEMORY = 1U << 2;
 const unsigned int CP_WRITE_MEMORY = 1U << 3;
+const unsigned int CP_READ_FPCXT = 1U << 4;
 
 /* Enumerates the MVE predicate and (data) vector types, together called
    "vector types" for brevity.  */