diff mbox series

[v2] middle-end: Drop __builtin_prefetch calls in autovectorization [PR114061]

Message ID 20240611094538.1129822-1-victor.donascimento@arm.com
State New
Headers show
Series [v2] middle-end: Drop __builtin_prefetch calls in autovectorization [PR114061] | expand

Commit Message

Victor Do Nascimento June 11, 2024, 9:45 a.m. UTC
At present the autovectorizer fails to vectorize simple loops
involving calls to `__builtin_prefetch'.  A simple example of such
loop is given below:

void foo(double * restrict a, double * restrict b, int n){
  int i;
  for(i=0; i<n; ++i){
    a[i] = a[i] + b[i];
    __builtin_prefetch(&(b[i+8]));
  }
}

The failure stems from two issues:

1. Given that it is typically not possible to fully reason about a
   function call due to the possibility of side effects, the
   autovectorizer does not attempt to vectorize loops which make such
   calls.

   Given the memory reference passed to `__builtin_prefetch', in the
   absence of assurances about its effect on the passed memory
   location the compiler deems the function unsafe to vectorize,
   marking it as clobbering memory in `vect_find_stmt_data_reference'.
   This leads to the failure in autovectorization.

2. Notwithstanding the above issue, though the prefetch statement
   would be classed as `vect_unused_in_scope', the loop invariant that
   is used in the address of the prefetch is the scalar loop's and not
   the vector loop's IV. That is, it still uses `i' and not `vec_iv'
   because the instruction wasn't vectorized, causing DCE to think the
   value is live, such that we now have both the vector and scalar loop
   invariant actively used in the loop.

This patch addresses both of these:

1. About the issue regarding the memory clobber, data prefetch does
   not generate faults if its address argument is invalid and does not
   write to memory.  Therefore, it does not alter the internal state
   of the program or its control flow under any circumstance.  As
   such, it is reasonable that the function be marked as not affecting
   memory contents.

   To achieve this, we add the necessary logic to
   `get_references_in_stmt' to ensure that builtin functions are given
   given the same treatment as internal functions.  If the gimple call
   is to a builtin function and its function code is
   `BUILT_IN_PREFETCH', we mark `clobbers_memory' as false.

2. Finding precedence in the way clobber statements are handled,
   whereby the vectorizer drops these from both the scalar and
   vectorized versions of a given loop, we choose to drop prefetch
   hints in a similar fashion.  This seems appropriate given how
   software prefetch hints are typically ignored by processors across
   architectures, as they seldom lead to performance gain over their
   hardware counterparts.

   PR tree-optimization/114061

gcc/ChangeLog:

	* tree-data-ref.cc (get_references_in_stmt): set
	`clobbers_memory' to false for __builtin_prefetch.
	* tree-vect-loop.cc (vect_transform_loop): Drop all
	__builtin_prefetch calls from loops.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/vect-prefetch-drop.c: New test.
	* gcc.target/aarch64/vect-prefetch-drop.c: Likewise.
---
 gcc/testsuite/gcc.dg/vect/vect-prefetch-drop.c      | 12 ++++++++++++
 .../gcc.target/aarch64/vect-prefetch-drop.c         | 13 +++++++++++++
 gcc/tree-data-ref.cc                                |  2 ++
 gcc/tree-vect-loop.cc                               |  6 ++++--
 4 files changed, 31 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-prefetch-drop.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vect-prefetch-drop.c

Comments

Richard Biener June 12, 2024, 12:48 p.m. UTC | #1
On Tue, Jun 11, 2024 at 11:46 AM Victor Do Nascimento
<victor.donascimento@arm.com> wrote:
>
> At present the autovectorizer fails to vectorize simple loops
> involving calls to `__builtin_prefetch'.  A simple example of such
> loop is given below:
>
> void foo(double * restrict a, double * restrict b, int n){
>   int i;
>   for(i=0; i<n; ++i){
>     a[i] = a[i] + b[i];
>     __builtin_prefetch(&(b[i+8]));
>   }
> }
>
> The failure stems from two issues:
>
> 1. Given that it is typically not possible to fully reason about a
>    function call due to the possibility of side effects, the
>    autovectorizer does not attempt to vectorize loops which make such
>    calls.
>
>    Given the memory reference passed to `__builtin_prefetch', in the
>    absence of assurances about its effect on the passed memory
>    location the compiler deems the function unsafe to vectorize,
>    marking it as clobbering memory in `vect_find_stmt_data_reference'.
>    This leads to the failure in autovectorization.
>
> 2. Notwithstanding the above issue, though the prefetch statement
>    would be classed as `vect_unused_in_scope', the loop invariant that
>    is used in the address of the prefetch is the scalar loop's and not
>    the vector loop's IV. That is, it still uses `i' and not `vec_iv'
>    because the instruction wasn't vectorized, causing DCE to think the
>    value is live, such that we now have both the vector and scalar loop
>    invariant actively used in the loop.
>
> This patch addresses both of these:
>
> 1. About the issue regarding the memory clobber, data prefetch does
>    not generate faults if its address argument is invalid and does not
>    write to memory.  Therefore, it does not alter the internal state
>    of the program or its control flow under any circumstance.  As
>    such, it is reasonable that the function be marked as not affecting
>    memory contents.
>
>    To achieve this, we add the necessary logic to
>    `get_references_in_stmt' to ensure that builtin functions are given
>    given the same treatment as internal functions.  If the gimple call
>    is to a builtin function and its function code is
>    `BUILT_IN_PREFETCH', we mark `clobbers_memory' as false.
>
> 2. Finding precedence in the way clobber statements are handled,
>    whereby the vectorizer drops these from both the scalar and
>    vectorized versions of a given loop, we choose to drop prefetch
>    hints in a similar fashion.  This seems appropriate given how
>    software prefetch hints are typically ignored by processors across
>    architectures, as they seldom lead to performance gain over their
>    hardware counterparts.

OK.

Thanks,
Richard.

>    PR tree-optimization/114061
>
> gcc/ChangeLog:
>
>         * tree-data-ref.cc (get_references_in_stmt): set
>         `clobbers_memory' to false for __builtin_prefetch.
>         * tree-vect-loop.cc (vect_transform_loop): Drop all
>         __builtin_prefetch calls from loops.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/vect/vect-prefetch-drop.c: New test.
>         * gcc.target/aarch64/vect-prefetch-drop.c: Likewise.
> ---
>  gcc/testsuite/gcc.dg/vect/vect-prefetch-drop.c      | 12 ++++++++++++
>  .../gcc.target/aarch64/vect-prefetch-drop.c         | 13 +++++++++++++
>  gcc/tree-data-ref.cc                                |  2 ++
>  gcc/tree-vect-loop.cc                               |  6 ++++--
>  4 files changed, 31 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-prefetch-drop.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/vect-prefetch-drop.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-prefetch-drop.c b/gcc/testsuite/gcc.dg/vect/vect-prefetch-drop.c
> new file mode 100644
> index 00000000000..7a8915eb716
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-prefetch-drop.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target vect_int } */
> +
> +void foo(int * restrict a, int * restrict b, int n){
> +  int i;
> +  for(i=0; i<n; ++i){
> +    a[i] = a[i] + b[i];
> +    __builtin_prefetch(&(b[i+8]));
> +  }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/vect-prefetch-drop.c b/gcc/testsuite/gcc.target/aarch64/vect-prefetch-drop.c
> new file mode 100644
> index 00000000000..e654b99fde8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vect-prefetch-drop.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile { target { aarch64*-*-* } } } */
> +/* { dg-additional-options "-O3 -march=armv9.2-a+sve --std=c99" { target { aarch64*-*-* } } } */
> +
> +void foo(double * restrict a, double * restrict b, int n){
> +  int i;
> +  for(i=0; i<n; ++i){
> +    a[i] = a[i] + b[i];
> +    __builtin_prefetch(&(b[i+8]));
> +  }
> +}
> +
> +/* { dg-final { scan-assembler-not "prfm" } } */
> +/* { dg-final { scan-assembler "fadd\tz\[0-9\]+.d, p\[0-9\]+/m, z\[0-9\]+.d, z\[0-9\]+.d" } } */
> diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc
> index 7b5f2d16238..bd61069b631 100644
> --- a/gcc/tree-data-ref.cc
> +++ b/gcc/tree-data-ref.cc
> @@ -5868,6 +5868,8 @@ get_references_in_stmt (gimple *stmt, vec<data_ref_loc, va_heap> *references)
>             clobbers_memory = true;
>             break;
>           }
> +      else if (gimple_call_builtin_p (stmt, BUILT_IN_PREFETCH))
> +       clobbers_memory = false;
>        else
>         clobbers_memory = true;
>      }
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index c471f1564a7..89cc6e64589 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -12177,8 +12177,10 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
>            !gsi_end_p (si);)
>         {
>           stmt = gsi_stmt (si);
> -         /* During vectorization remove existing clobber stmts.  */
> -         if (gimple_clobber_p (stmt))
> +         /* During vectorization remove existing clobber stmts and
> +            prefetches.  */
> +         if (gimple_clobber_p (stmt)
> +             || gimple_call_builtin_p (stmt, BUILT_IN_PREFETCH))
>             {
>               unlink_stmt_vdef (stmt);
>               gsi_remove (&si, true);
> --
> 2.34.1
>
diff mbox series

Patch

diff --git a/gcc/testsuite/gcc.dg/vect/vect-prefetch-drop.c b/gcc/testsuite/gcc.dg/vect/vect-prefetch-drop.c
new file mode 100644
index 00000000000..7a8915eb716
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-prefetch-drop.c
@@ -0,0 +1,12 @@ 
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+
+void foo(int * restrict a, int * restrict b, int n){
+  int i;
+  for(i=0; i<n; ++i){
+    a[i] = a[i] + b[i];
+    __builtin_prefetch(&(b[i+8]));
+  }
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vect-prefetch-drop.c b/gcc/testsuite/gcc.target/aarch64/vect-prefetch-drop.c
new file mode 100644
index 00000000000..e654b99fde8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vect-prefetch-drop.c
@@ -0,0 +1,13 @@ 
+/* { dg-do compile { target { aarch64*-*-* } } } */
+/* { dg-additional-options "-O3 -march=armv9.2-a+sve --std=c99" { target { aarch64*-*-* } } } */
+
+void foo(double * restrict a, double * restrict b, int n){
+  int i;
+  for(i=0; i<n; ++i){
+    a[i] = a[i] + b[i];
+    __builtin_prefetch(&(b[i+8]));
+  }
+}
+
+/* { dg-final { scan-assembler-not "prfm" } } */
+/* { dg-final { scan-assembler "fadd\tz\[0-9\]+.d, p\[0-9\]+/m, z\[0-9\]+.d, z\[0-9\]+.d" } } */
diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc
index 7b5f2d16238..bd61069b631 100644
--- a/gcc/tree-data-ref.cc
+++ b/gcc/tree-data-ref.cc
@@ -5868,6 +5868,8 @@  get_references_in_stmt (gimple *stmt, vec<data_ref_loc, va_heap> *references)
 	    clobbers_memory = true;
 	    break;
 	  }
+      else if (gimple_call_builtin_p (stmt, BUILT_IN_PREFETCH))
+	clobbers_memory = false;
       else
 	clobbers_memory = true;
     }
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index c471f1564a7..89cc6e64589 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -12177,8 +12177,10 @@  vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
 	   !gsi_end_p (si);)
 	{
 	  stmt = gsi_stmt (si);
-	  /* During vectorization remove existing clobber stmts.  */
-	  if (gimple_clobber_p (stmt))
+	  /* During vectorization remove existing clobber stmts and
+	     prefetches.  */
+	  if (gimple_clobber_p (stmt)
+	      || gimple_call_builtin_p (stmt, BUILT_IN_PREFETCH))
 	    {
 	      unlink_stmt_vdef (stmt);
 	      gsi_remove (&si, true);