[RFC] test builtin ratio for loop distribution

Message ID	orim7i8sei.fsf@lxoliva.fsfla.org
State	New
Headers	show Return-Path: <gcc-patches-bounces@gcc.gnu.org> DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 74019388A429 From: Alexandre Oliva <oliva@adacore.com> To: gcc-patches@gcc.gnu.org Subject: [RFC] test builtin ratio for loop distribution Organization: Free thinker, does not speak for AdaCore Date: Wed, 27 Jan 2021 09:40:21 -0300 Message-ID: <orim7i8sei.fsf@lxoliva.fsfla.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Precedence: list Cc: Zdenek Dvorak <ook@ucw.cz> Errors-To: gcc-patches-bounces@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces@gcc.gnu.org>
Series	[RFC] test builtin ratio for loop distribution \| expand [RFC] test builtin ratio for loop distribution

Alexandre Oliva Jan. 27, 2021, 12:40 p.m. UTC

This patch attempts to fix a libgcc codegen regression introduced in
gcc-10, as -ftree-loop-distribute-patterns was enabled at -O2.


The ldist pass turns even very short loops into memset calls.  E.g.,
the TFmode emulation calls end with a loop of up to 3 iterations, to
zero out trailing words, and the loop distribution pass turns them
into calls of the memset builtin.

Though short constant-length memsets are usually dealt with
efficiently, for non-constant-length ones, the options are setmemM, or
a function calls.

RISC-V doesn't have any setmemM pattern, so the loops above end up
"optimized" into memset calls, incurring not only the overhead of an
explicit call, but also discarding the information the compiler has
about the alignment of the destination, and that the length is a
multiple of the word alignment.

This patch adds to the loop distribution pass some cost analysis based
on preexisting *_RATIO macros, so that we won't transform loops with
trip counts as low as the ratios we'd rather expand inline.


This patch is not finished; it needs adjustments to the testsuite, to
make up for the behavior changes it brings about.  Specifically, on a
x86_64-linux-gnu regstrap, it regresses:

> FAIL: gcc.dg/pr53265.c  (test for warnings, line 40)
> FAIL: gcc.dg/pr53265.c  (test for warnings, line 42)
> FAIL: gcc.dg/tree-ssa/ldist-38.c scan-tree-dump ldist "split to 0 loops and 1 library cal> FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++14  scan-tree-dump ldist "split to 0 loops and 1 library calls"
> FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++17  scan-tree-dump ldist "split to 0 loops and 1 library calls"
> FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++2a  scan-tree-dump ldist "split to 0 loops and 1 library calls"

I suppose just lengthening the loops will take care of ldist-38 and
pr78847, but the loss of the warnings in pr53265 is more concerning, and
will require investigation.

Nevertheless, I seek feedback on whether this is an acceptable approach,
or whether we should use alternate tuning parameters for ldist, or
something entirely different.  Thanks in advance,


for  gcc/ChangeLog

	* tree-loop-distribution.c (maybe_normalize_partition): New.
	(loop_distribution::distribute_loop): Call it.

[requires testsuite adjustments and investigation of a warning regression]
---
 gcc/tree-loop-distribution.c |   54 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

Richard Biener Jan. 27, 2021, 3:12 p.m. UTC | #1

On Wed, Jan 27, 2021 at 2:18 PM Alexandre Oliva <oliva@adacore.com> wrote:
>
>
> This patch attempts to fix a libgcc codegen regression introduced in
> gcc-10, as -ftree-loop-distribute-patterns was enabled at -O2.
>
>
> The ldist pass turns even very short loops into memset calls.  E.g.,
> the TFmode emulation calls end with a loop of up to 3 iterations, to
> zero out trailing words, and the loop distribution pass turns them
> into calls of the memset builtin.
>
> Though short constant-length memsets are usually dealt with
> efficiently, for non-constant-length ones, the options are setmemM, or
> a function calls.

There's also folding which is expected to turn small memcpy/memset
into simple stmts again.

> RISC-V doesn't have any setmemM pattern, so the loops above end up
> "optimized" into memset calls, incurring not only the overhead of an
> explicit call, but also discarding the information the compiler has
> about the alignment of the destination, and that the length is a
> multiple of the word alignment.

But yes, this particular issue has come up before.  We do have some
on-the-side info for alignment and size estimates from profiling, notably
memset expansion will call

  if (currently_expanding_gimple_stmt)
    stringop_block_profile (currently_expanding_gimple_stmt,
                            &expected_align, &expected_size);

I'm not sure whether those would be a suitable vehicle to record
known alignments.  The alternative is to create builtin variants
that encode this info and are usable for the middle-end here.

That said, rather than not transforming the loop as you do I'd
say we want to re-inline small copies more forcefully during
loop distribution code-gen so we turn a loop that sets
3 'short int' to zero into a 'int' store and a 'short' store for example
(we have more code-size leeway here because we formerly had
a loop).

Since you don't add a testcase I can't see whether the actual
case would be fixed by setting SSA pointer alignment on the
memset arguments (that is, whether they are invariant and thus
no SSA name is involved or not).

> This patch adds to the loop distribution pass some cost analysis based
> on preexisting *_RATIO macros, so that we won't transform loops with
> trip counts as low as the ratios we'd rather expand inline.
>
>
> This patch is not finished; it needs adjustments to the testsuite, to
> make up for the behavior changes it brings about.  Specifically, on a
> x86_64-linux-gnu regstrap, it regresses:
>
> > FAIL: gcc.dg/pr53265.c  (test for warnings, line 40)
> > FAIL: gcc.dg/pr53265.c  (test for warnings, line 42)
> > FAIL: gcc.dg/tree-ssa/ldist-38.c scan-tree-dump ldist "split to 0 loops and 1 library cal> FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++14  scan-tree-dump ldist "split to 0 loops and 1 library calls"
> > FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++17  scan-tree-dump ldist "split to 0 loops and 1 library calls"
> > FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++2a  scan-tree-dump ldist "split to 0 loops and 1 library calls"
>
> I suppose just lengthening the loops will take care of ldist-38 and
> pr78847, but the loss of the warnings in pr53265 is more concerning, and
> will require investigation.
>
> Nevertheless, I seek feedback on whether this is an acceptable approach,
> or whether we should use alternate tuning parameters for ldist, or
> something entirely different.  Thanks in advance,
>
>
> for  gcc/ChangeLog
>
>         * tree-loop-distribution.c (maybe_normalize_partition): New.
>         (loop_distribution::distribute_loop): Call it.
>
> [requires testsuite adjustments and investigation of a warning regression]
> ---
>  gcc/tree-loop-distribution.c |   54 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 54 insertions(+)
>
> diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> index bb15fd3723fb6..b5198652817ee 100644
> --- a/gcc/tree-loop-distribution.c
> +++ b/gcc/tree-loop-distribution.c
> @@ -2848,6 +2848,52 @@ fuse_memset_builtins (vec<struct partition *> *partitions)
>      }
>  }
>
> +/* Return false if it's profitable to turn the LOOP PARTITION into a builtin
> +   call, and true if it wasn't, changing the PARTITION to PKIND_NORMAL.  */
> +
> +static bool
> +maybe_normalize_partition (class loop *loop, struct partition *partition)
> +{
> +  unsigned HOST_WIDE_INT ratio;
> +
> +  switch (partition->kind)
> +    {
> +    case PKIND_NORMAL:
> +    case PKIND_PARTIAL_MEMSET:
> +      return false;
> +
> +    case PKIND_MEMSET:
> +      if (integer_zerop (gimple_assign_rhs1 (DR_STMT
> +                                            (partition->builtin->dst_dr))))
> +       ratio = CLEAR_RATIO (optimize_loop_for_speed_p (loop));
> +      else
> +       ratio = SET_RATIO (optimize_loop_for_speed_p (loop));
> +      break;
> +
> +    case PKIND_MEMCPY:
> +    case PKIND_MEMMOVE:
> +      ratio = MOVE_RATIO (optimize_loop_for_speed_p (loop));
> +      break;
> +
> +    default:
> +      gcc_unreachable ();
> +    }
> +
> +  tree niters = number_of_latch_executions (loop);
> +  if (niters == NULL_TREE || niters == chrec_dont_know)
> +    return false;
> +
> +  wide_int minit, maxit;
> +  value_range_kind vrk = determine_value_range (niters, &minit, &maxit);
> +  if (vrk == VR_RANGE && wi::ltu_p (maxit, ratio))
> +    {
> +      partition->kind = PKIND_NORMAL;
> +      return true;
> +    }
> +
> +  return false;
> +}
> +
>  void
>  loop_distribution::finalize_partitions (class loop *loop,
>                                         vec<struct partition *> *partitions,
> @@ -3087,6 +3133,14 @@ loop_distribution::distribute_loop (class loop *loop, vec<gimple *> stmts,
>      }
>
>    finalize_partitions (loop, &partitions, &alias_ddrs);
> +  {
> +    bool any_changes_p = false;
> +    for (i = 0; partitions.iterate (i, &partition); ++i)
> +      if (maybe_normalize_partition (loop, partition))
> +       any_changes_p = true;
> +    if (any_changes_p)
> +      finalize_partitions (loop, &partitions, &alias_ddrs);
> +  }
>
>    /* If there is a reduction in all partitions make sure the last one
>       is not classified for builtin code generation.  */
>
> --
> Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
>    Free Software Activist         GNU Toolchain Engineer
>         Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

Alexandre Oliva Jan. 28, 2021, 5:28 a.m. UTC | #2

On Jan 27, 2021, Richard Biener <richard.guenther@gmail.com> wrote:

> That said, rather than not transforming the loop as you do I'd
> say we want to re-inline small copies more forcefully during
> loop distribution code-gen so we turn a loop that sets
> 3 'short int' to zero into a 'int' store and a 'short' store for example
> (we have more code-size leeway here because we formerly had
> a loop).

Ok, that makes sense to me, if there's a chance of growing the access
stride.

> Since you don't add a testcase

Uhh, sorry, I mentioned TFmode emulation calls, but I wasn't explicit I
meant the soft-fp ones from libgcc.

./xgcc -B./ -O2 $srcdir/libgcc/soft-fp/fixtfdi.c \
  -I$srcdir/libgcc/config/riscv -I$srcdir/include \
  -S -o - | grep memset

> I can't see whether the actual case would be fixed by setting SSA
> pointer alignment on the memset arguments

The dest pointer alignment is known at the builtin expansion time,
that's not a problem.  What isn't known (*) is that the length is a
multiple of the word size: what gets to the expander is that it's
between 4 and 12 bytes (targeting 32-bit risc-v), but not that it's
either 4, 8 or 12.  Coming up with an efficient inline expansion becomes
a bit of a challenge without that extra knowledge.

(*) at least before my related patch for get_nonzero_bits
https://gcc.gnu.org/pipermail/gcc-patches/2021-January/564344.html

Richard Biener Jan. 28, 2021, 8:59 a.m. UTC | #3

On Thu, Jan 28, 2021 at 6:28 AM Alexandre Oliva <oliva@adacore.com> wrote:
>
> On Jan 27, 2021, Richard Biener <richard.guenther@gmail.com> wrote:
>
> > That said, rather than not transforming the loop as you do I'd
> > say we want to re-inline small copies more forcefully during
> > loop distribution code-gen so we turn a loop that sets
> > 3 'short int' to zero into a 'int' store and a 'short' store for example
> > (we have more code-size leeway here because we formerly had
> > a loop).
>
> Ok, that makes sense to me, if there's a chance of growing the access
> stride.
>
> > Since you don't add a testcase
>
> Uhh, sorry, I mentioned TFmode emulation calls, but I wasn't explicit I
> meant the soft-fp ones from libgcc.
>
> ./xgcc -B./ -O2 $srcdir/libgcc/soft-fp/fixtfdi.c \
>   -I$srcdir/libgcc/config/riscv -I$srcdir/include \
>   -S -o - | grep memset
>
> > I can't see whether the actual case would be fixed by setting SSA
> > pointer alignment on the memset arguments
>
> The dest pointer alignment is known at the builtin expansion time,
> that's not a problem.  What isn't known (*) is that the length is a
> multiple of the word size: what gets to the expander is that it's
> between 4 and 12 bytes (targeting 32-bit risc-v), but not that it's
> either 4, 8 or 12.  Coming up with an efficient inline expansion becomes
> a bit of a challenge without that extra knowledge.

Ah, yes - that's also useful information.  I guess for memset
an enhanced builtin would then have calloc style
size and nmemb arguments where the destination is expected
to have at least 'size' alignment.  That would allow turning
back the memset into the original loop (but with optimal IVs, etc.).
So kind of the x86 rep mov{bwlq}.  Now think of sth similarly
useful for memcpy/move.

Richard.

>
>
> (*) at least before my related patch for get_nonzero_bits
> https://gcc.gnu.org/pipermail/gcc-patches/2021-January/564344.html
>
> --
> Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
>    Free Software Activist         GNU Toolchain Engineer
>         Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

Alexandre Oliva Feb. 2, 2021, 5:13 p.m. UTC | #4

On Jan 28, 2021, Richard Biener <richard.guenther@gmail.com> wrote:

> That would allow turning back the memset into the original loop (but
> with optimal IVs, etc.).

Is this sort of what you had in mind?

I haven't tested the inline expansion of memset much yet; and that of
memcpy, not at all; this really is mainly to check that I understood
what you had in mind before I spend further time polishing it.


test builtin ratio for loop distribution

From: Alexandre Oliva <oliva@adacore.com>

The ldist pass turns even very short loops into memset calls.  E.g.,
the TFmode emulation calls end with a loop of up to 3 iterations, to
zero out trailing words, and the loop distribution pass turns them
into calls of the memset builtin.

Though short constant-length memsets are usually dealt with
efficiently, for non-constant-length ones, the options are setmemM, or
a function calls.

RISC-V doesn't have any setmemM pattern, so the loops above end up
"optimized" into memset calls, incurring not only the overhead of an
explicit call, but also discarding the information the compiler has
about the alignment of the destination, and that the length is a
multiple of the word alignment.

This patch adds, to the loop distribution pass, the ability to perform
inline expansion of bounded variable-length memset and memcpy in ways
that take advantage of known alignments and size's factors, when
preexisting *_RATIO macros suggest the inline expansion is
advantageous.


for  gcc/ChangeLog

	* tree-loop-distribution.c: Include builtins.h.
	(generate_memset_builtin): Introduce optional inline expansion
	of bounded variable-sized memset calls.
	(generate_memcpy_builtin): Likewise for memcpy only.
	(loop_distribution::execute): Fix loop structure afterwards.
---
 gcc/tree-loop-distribution.c |  280 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 279 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index bb15fd3723fb6..3be7a4c1ac281 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -115,6 +115,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-vectorizer.h"
 #include "tree-eh.h"
 #include "gimple-fold.h"
+#include "builtins.h"
 
 
 #define MAX_DATAREFS_NUM \
@@ -1148,6 +1149,23 @@ generate_memset_builtin (class loop *loop, partition *partition)
   /* The new statements will be placed before LOOP.  */
   gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
 
+  /* Compute builtin->size range and ctz before it's gimplified; sub-expressions
+     thereof are rewritten in place, so they end up referencing SSA_NAMEs for
+     which we don't have VR info.  */
+  unsigned align = get_pointer_alignment (builtin->dst_base) / BITS_PER_UNIT;
+  unsigned alctz, szctz, xbits;
+  wide_int szmin, szmax;
+  value_range_kind vrk;
+  if (align)
+    {
+      alctz = wi::ctz (align);
+      szctz = MIN (tree_ctz (builtin->size), alctz);
+      xbits = alctz - szctz;
+      vrk = determine_value_range (builtin->size, &szmin, &szmax);
+      if (szmin == szmax)
+	align = 0;
+    }
+
   nb_bytes = rewrite_to_non_trapping_overflow (builtin->size);
   nb_bytes = force_gimple_operand_gsi (&gsi, nb_bytes, true, NULL_TREE,
 				       false, GSI_CONTINUE_LINKING);
@@ -1172,6 +1190,127 @@ generate_memset_builtin (class loop *loop, partition *partition)
       val = tem;
     }
 
+  unsigned HOST_WIDE_INT ratio;
+  if (integer_zerop (val))
+    ratio = CLEAR_RATIO (optimize_loop_for_speed_p (loop));
+  else
+    ratio = SET_RATIO (optimize_loop_for_speed_p (loop));
+
+  /* Compare the ratio with the number of (most-aligned) required stores needed
+     for szmax bytes, with the ratio: a loop that iterates up to szmax >> alctz,
+     and then one conditional store for each extra bit of alignment that the
+     destination has over the size.  */
+  if (align && vrk == VR_RANGE
+      && wi::ltu_p (wi::rshift (szmax, alctz, UNSIGNED) + xbits, ratio))
+    {
+      gimple *stmt;
+      tree bytes = create_tmp_var_raw (size_type_node, "ldistbytes");
+      tree ptr = create_tmp_var_raw (build_pointer_type (char_type_node),
+				     "ldistptr");
+      tree stval = force_gimple_operand_gsi (&gsi,
+					     fold_convert
+					     (unsigned_char_type_node, val),
+					     true, NULL_TREE, false,
+					     GSI_CONTINUE_LINKING);
+
+      for (unsigned i = 1; i != alctz; i *= 2)
+	{
+	  unsigned bits = i * 2 * BITS_PER_UNIT;
+	  tree type = build_nonstandard_integer_type (bits, true);
+	  stval = fold_convert (type, stval);
+	  tree op2 = fold_build2_loc (partition->loc, LSHIFT_EXPR,
+				      TREE_TYPE (stval), stval,
+				      build_int_cst (type, i * BITS_PER_UNIT));
+	  stval = fold_build2_loc (partition->loc, BIT_IOR_EXPR,
+				   TREE_TYPE (stval), stval, op2);
+	  stval = force_gimple_operand_gsi (&gsi, stval, true, NULL_TREE,
+					    false, GSI_CONTINUE_LINKING);
+	}
+
+      stmt = gimple_build_assign (bytes, nb_bytes);
+      gimple_set_location (stmt, partition->loc);
+      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+      stmt = gimple_build_assign (ptr, mem);
+      gimple_set_location (stmt, partition->loc);
+      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+      gsi = gsi_start_bb (split_block (gsi_bb (gsi), stmt)->dest);
+
+      for (unsigned i = 0; i <= xbits; i++)
+	{
+	  tree blksz = build_int_cst (size_type_node, align >> i);
+
+	  if (i > 0)
+	    {
+	      unsigned bits = (align >> i) * BITS_PER_UNIT;
+	      tree type = build_nonstandard_integer_type (bits, true);
+	      stval = fold_convert (type, stval);
+	      stval = force_gimple_operand_gsi (&gsi, stval, true, NULL_TREE,
+						false, GSI_CONTINUE_LINKING);
+	    }
+
+	  tree labcond = NULL; // create_artificial_label (partition->loc);
+	  tree labskip = NULL; // create_artificial_label (partition->loc);
+
+	  stmt = gimple_build_cond (GE_EXPR, bytes, blksz, labcond, labskip);
+	  gimple_set_location (stmt, partition->loc);
+	  gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+	  basic_block bbfrom = gsi_bb (gsi);
+	  basic_block bbcond = split_block (bbfrom, stmt)->dest;
+
+	  gsi = gsi_start_bb (bbcond);
+
+	  stmt = gimple_build_assign (fold_build2
+				      (MEM_REF, TREE_TYPE (stval), ptr,
+				       build_int_cst (TREE_TYPE (ptr), 0)),
+				      stval);
+	  gimple_set_location (stmt, partition->loc);
+	  gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+	  if (i == 0 || i < xbits)
+	    {
+	      stmt = gimple_build_assign (ptr,
+					  fold_build2_loc
+					  (partition->loc,
+					   POINTER_PLUS_EXPR,
+					   TREE_TYPE (ptr),
+					   ptr, blksz));
+	      gimple_set_location (stmt, partition->loc);
+	      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+	      stmt = gimple_build_assign (bytes,
+					  fold_build2_loc
+					  (partition->loc,
+					   MINUS_EXPR,
+					   TREE_TYPE (bytes),
+					   bytes, blksz));
+	      gimple_set_location (stmt, partition->loc);
+	      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+	    }
+
+	  basic_block bbskip = split_block (bbcond, stmt)->dest;
+
+	  if (i == 0)
+	    redirect_edge_succ (single_succ_edge (bbcond), bbfrom);
+
+	  single_succ_edge (bbfrom)->flags |= EDGE_TRUE_VALUE;
+	  single_succ_edge (bbfrom)->flags &= ~EDGE_FALLTHRU;
+	  make_edge (bbfrom, bbskip, EDGE_FALSE_VALUE);
+	  set_immediate_dominator (CDI_DOMINATORS, bbskip, bbfrom);
+	}
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "generated memset, inlined with %u-trip loop ctzs %u..%u \n",
+		 unsigned((wi::rshift (szmax, alctz, UNSIGNED)
+			   + xbits).to_uhwi ()),
+		 alctz, szctz);
+
+      return;
+    }
+
   fn = build_fold_addr_expr (builtin_decl_implicit (BUILT_IN_MEMSET));
   fn_call = gimple_build_call (fn, 3, mem, val, nb_bytes);
   gimple_set_location (fn_call, partition->loc);
@@ -1202,6 +1341,27 @@ generate_memcpy_builtin (class loop *loop, partition *partition)
   /* The new statements will be placed before LOOP.  */
   gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
 
+  /* Compute builtin->size range and ctz before it's gimplified; sub-expressions
+     thereof are rewritten in place, so they end up referencing SSA_NAMEs for
+     which we don't have VR info.  */
+  unsigned align = (MIN (get_pointer_alignment (builtin->dst_base),
+			 get_pointer_alignment (builtin->src_base))
+		    / BITS_PER_UNIT);
+  unsigned alctz, szctz, xbits;
+  wide_int szmin, szmax;
+  value_range_kind vrk;
+  if (align)
+    {
+      alctz = wi::ctz (align);
+      szctz = MIN (tree_ctz (builtin->size), alctz);
+      xbits = alctz - szctz;
+      vrk = determine_value_range (builtin->size, &szmin, &szmax);
+      /* A call with constant size will be expanded into an unrolled loop
+	 later.	 */
+      if (szmin == szmax)
+	align = 0;
+    }
+
   nb_bytes = rewrite_to_non_trapping_overflow (builtin->size);
   nb_bytes = force_gimple_operand_gsi (&gsi, nb_bytes, true, NULL_TREE,
 				       false, GSI_CONTINUE_LINKING);
@@ -1211,12 +1371,127 @@ generate_memcpy_builtin (class loop *loop, partition *partition)
       || ! ptr_derefs_may_alias_p (dest, src))
     kind = BUILT_IN_MEMCPY;
   else
-    kind = BUILT_IN_MEMMOVE;
+    {
+      kind = BUILT_IN_MEMMOVE;
+      /* The inlined loop won't handle overlapping memory areas.  */
+      align = 0;
+    }
 
   dest = force_gimple_operand_gsi (&gsi, dest, true, NULL_TREE,
 				   false, GSI_CONTINUE_LINKING);
   src = force_gimple_operand_gsi (&gsi, src, true, NULL_TREE,
 				  false, GSI_CONTINUE_LINKING);
+
+  unsigned HOST_WIDE_INT ratio;
+  ratio = MOVE_RATIO (optimize_loop_for_speed_p (loop));
+
+  /* Compare the ratio with the number of (most-aligned) required stores needed
+     for szmax bytes, with the ratio: a loop that iterates up to szmax >> alctz,
+     and then one conditional store for each extra bit of alignment that the
+     destination has over the size.  */
+  if (align && vrk == VR_RANGE
+      && wi::ltu_p (wi::rshift (szmax, alctz, UNSIGNED) + xbits, ratio))
+    {
+      gimple *stmt;
+      tree bytes = create_tmp_var_raw (size_type_node, "ldistbytes");
+      tree dptr = create_tmp_var_raw (build_pointer_type (char_type_node),
+				     "ldistdptr");
+      tree sptr = create_tmp_var_raw (build_pointer_type (char_type_node),
+				     "ldistsptr");
+
+      stmt = gimple_build_assign (bytes, nb_bytes);
+      gimple_set_location (stmt, partition->loc);
+      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+      stmt = gimple_build_assign (dptr, dest);
+      gimple_set_location (stmt, partition->loc);
+      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+      stmt = gimple_build_assign (sptr, src);
+      gimple_set_location (stmt, partition->loc);
+      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+      gsi = gsi_start_bb (split_block (gsi_bb (gsi), stmt)->dest);
+
+      for (unsigned i = 0; i <= xbits; i++)
+	{
+	  tree blksz = build_int_cst (size_type_node, align >> i);
+	  unsigned bits = (align >> i) * BITS_PER_UNIT;
+	  tree type = build_nonstandard_integer_type (bits, true);
+	  tree val = make_ssa_name (type);
+
+	  stmt = gimple_build_cond (GE_EXPR, bytes, blksz, NULL, NULL);
+	  gimple_set_location (stmt, partition->loc);
+	  gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+	  basic_block bbfrom = gsi_bb (gsi);
+	  basic_block bbcond = split_block (bbfrom, stmt)->dest;
+
+	  gsi = gsi_start_bb (bbcond);
+	  stmt = gimple_build_assign (val,
+				      fold_build2
+				      (MEM_REF, TREE_TYPE (val), sptr,
+				       build_int_cst (TREE_TYPE (sptr), 0)));
+	  gimple_set_location (stmt, partition->loc);
+	  gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+	  stmt = gimple_build_assign (fold_build2
+				      (MEM_REF, TREE_TYPE (val), dptr,
+				       build_int_cst (TREE_TYPE (dptr), 0)),
+				      val);
+	  gimple_set_location (stmt, partition->loc);
+	  gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+	  if (i == 0 || i < xbits)
+	    {
+	      stmt = gimple_build_assign (sptr,
+					  fold_build2_loc
+					  (partition->loc,
+					   POINTER_PLUS_EXPR,
+					   TREE_TYPE (sptr),
+					   sptr, blksz));
+	      gimple_set_location (stmt, partition->loc);
+	      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+	      stmt = gimple_build_assign (dptr,
+					  fold_build2_loc
+					  (partition->loc,
+					   POINTER_PLUS_EXPR,
+					   TREE_TYPE (dptr),
+					   dptr, blksz));
+	      gimple_set_location (stmt, partition->loc);
+	      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+	      stmt = gimple_build_assign (bytes,
+					  fold_build2_loc
+					  (partition->loc,
+					   MINUS_EXPR,
+					   TREE_TYPE (bytes),
+					   bytes, blksz));
+	      gimple_set_location (stmt, partition->loc);
+	      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+	    }
+
+	  basic_block bbskip = split_block (bbcond, stmt)->dest;
+	  if (i == 0)
+	    redirect_edge_succ (single_succ_edge (bbcond), bbfrom);
+
+	  single_succ_edge (bbfrom)->flags |= EDGE_TRUE_VALUE;
+	  single_succ_edge (bbfrom)->flags &= ~EDGE_FALLTHRU;
+	  make_edge (bbfrom, bbskip, EDGE_FALSE_VALUE);
+	  set_immediate_dominator (CDI_DOMINATORS, bbskip, bbfrom);
+	}
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "generated memcpy, inlined with %u-trip loop ctzs %u..%u \n",
+		 unsigned((wi::rshift (szmax, alctz, UNSIGNED)
+			   + xbits).to_uhwi ()),
+		 alctz, szctz);
+
+      return;
+    }
+
   fn = build_fold_addr_expr (builtin_decl_implicit (kind));
   fn_call = gimple_build_call (fn, 3, dest, src, nb_bytes);
   gimple_set_location (fn_call, partition->loc);
@@ -3357,6 +3632,9 @@ loop_distribution::execute (function *fun)
       scev_reset_htab ();
       mark_virtual_operands_for_renaming (fun);
       rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
+      /* Newly-added loops, for inline expansions memset/memcpy, need to be
+	 integrated.  */
+      fix_loop_structure (NULL);
     }
 
   checking_verify_loop_structure ();

Richard Biener Feb. 3, 2021, 8:59 a.m. UTC | #5

On Tue, Feb 2, 2021 at 6:14 PM Alexandre Oliva <oliva@adacore.com> wrote:
>
> On Jan 28, 2021, Richard Biener <richard.guenther@gmail.com> wrote:
>
> > That would allow turning back the memset into the original loop (but
> > with optimal IVs, etc.).
>
> Is this sort of what you had in mind?
>
> I haven't tested the inline expansion of memset much yet; and that of
> memcpy, not at all; this really is mainly to check that I understood
> what you had in mind before I spend further time polishing it.

So I think we should try to match what __builtin_memcpy/memset
expansion would do here, taking advantage of extra alignment
and size knowledge.  In particular,

 a) if __builtin_memcpy/memset would use setmem/cpymem optabs
     see if we can have variants of memcpy/memset transferring alignment
     and size knowledge

 b) if expansion would use BY_PIECES then expand to an unrolled loop

 c) if expansion would emit a memset/memcpy call but we know
     alignment and have a low bound on niters emit a loop (like your patch does)

a) might be difficult but adding the builtin variants may pay off in any case.

The patch itself could benefit from one or two helpers we already
have, first of all there's create_empty_loop_on_edge (so you don't
need the loop fixup) which conveniently adds the control IV for you.
If you want to use pointer IVs for simplicity there's create_iv.  In the
end you shouldn't need more than creating the actual memory GIMPLE
stmts.  If expansion would use BY_PIECES you can implement the
unrolled code by setting loop->unroll to the number of iterations
to (maximally) peel and cunroll will take care of that.

Note that for memmove if we know the dependence direction, we
can also emit a loop / unrolled code.

I think the builtins with alignment and calloc-style element count
will be useful on its own.

Richard.

>
> test builtin ratio for loop distribution
>
> From: Alexandre Oliva <oliva@adacore.com>
>
> The ldist pass turns even very short loops into memset calls.  E.g.,
> the TFmode emulation calls end with a loop of up to 3 iterations, to
> zero out trailing words, and the loop distribution pass turns them
> into calls of the memset builtin.
>
> Though short constant-length memsets are usually dealt with
> efficiently, for non-constant-length ones, the options are setmemM, or
> a function calls.
>
> RISC-V doesn't have any setmemM pattern, so the loops above end up
> "optimized" into memset calls, incurring not only the overhead of an
> explicit call, but also discarding the information the compiler has
> about the alignment of the destination, and that the length is a
> multiple of the word alignment.
>
> This patch adds, to the loop distribution pass, the ability to perform
> inline expansion of bounded variable-length memset and memcpy in ways
> that take advantage of known alignments and size's factors, when
> preexisting *_RATIO macros suggest the inline expansion is
> advantageous.
>
>
> for  gcc/ChangeLog
>
>         * tree-loop-distribution.c: Include builtins.h.
>         (generate_memset_builtin): Introduce optional inline expansion
>         of bounded variable-sized memset calls.
>         (generate_memcpy_builtin): Likewise for memcpy only.
>         (loop_distribution::execute): Fix loop structure afterwards.
> ---
>  gcc/tree-loop-distribution.c |  280 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 279 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> index bb15fd3723fb6..3be7a4c1ac281 100644
> --- a/gcc/tree-loop-distribution.c
> +++ b/gcc/tree-loop-distribution.c
> @@ -115,6 +115,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-vectorizer.h"
>  #include "tree-eh.h"
>  #include "gimple-fold.h"
> +#include "builtins.h"
>
>
>  #define MAX_DATAREFS_NUM \
> @@ -1148,6 +1149,23 @@ generate_memset_builtin (class loop *loop, partition *partition)
>    /* The new statements will be placed before LOOP.  */
>    gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
>
> +  /* Compute builtin->size range and ctz before it's gimplified; sub-expressions
> +     thereof are rewritten in place, so they end up referencing SSA_NAMEs for
> +     which we don't have VR info.  */
> +  unsigned align = get_pointer_alignment (builtin->dst_base) / BITS_PER_UNIT;
> +  unsigned alctz, szctz, xbits;
> +  wide_int szmin, szmax;
> +  value_range_kind vrk;
> +  if (align)
> +    {
> +      alctz = wi::ctz (align);
> +      szctz = MIN (tree_ctz (builtin->size), alctz);
> +      xbits = alctz - szctz;
> +      vrk = determine_value_range (builtin->size, &szmin, &szmax);
> +      if (szmin == szmax)
> +       align = 0;
> +    }
> +
>    nb_bytes = rewrite_to_non_trapping_overflow (builtin->size);
>    nb_bytes = force_gimple_operand_gsi (&gsi, nb_bytes, true, NULL_TREE,
>                                        false, GSI_CONTINUE_LINKING);
> @@ -1172,6 +1190,127 @@ generate_memset_builtin (class loop *loop, partition *partition)
>        val = tem;
>      }
>
> +  unsigned HOST_WIDE_INT ratio;
> +  if (integer_zerop (val))
> +    ratio = CLEAR_RATIO (optimize_loop_for_speed_p (loop));
> +  else
> +    ratio = SET_RATIO (optimize_loop_for_speed_p (loop));
> +
> +  /* Compare the ratio with the number of (most-aligned) required stores needed
> +     for szmax bytes, with the ratio: a loop that iterates up to szmax >> alctz,
> +     and then one conditional store for each extra bit of alignment that the
> +     destination has over the size.  */
> +  if (align && vrk == VR_RANGE
> +      && wi::ltu_p (wi::rshift (szmax, alctz, UNSIGNED) + xbits, ratio))
> +    {
> +      gimple *stmt;
> +      tree bytes = create_tmp_var_raw (size_type_node, "ldistbytes");
> +      tree ptr = create_tmp_var_raw (build_pointer_type (char_type_node),
> +                                    "ldistptr");
> +      tree stval = force_gimple_operand_gsi (&gsi,
> +                                            fold_convert
> +                                            (unsigned_char_type_node, val),
> +                                            true, NULL_TREE, false,
> +                                            GSI_CONTINUE_LINKING);
> +
> +      for (unsigned i = 1; i != alctz; i *= 2)
> +       {
> +         unsigned bits = i * 2 * BITS_PER_UNIT;
> +         tree type = build_nonstandard_integer_type (bits, true);
> +         stval = fold_convert (type, stval);
> +         tree op2 = fold_build2_loc (partition->loc, LSHIFT_EXPR,
> +                                     TREE_TYPE (stval), stval,
> +                                     build_int_cst (type, i * BITS_PER_UNIT));
> +         stval = fold_build2_loc (partition->loc, BIT_IOR_EXPR,
> +                                  TREE_TYPE (stval), stval, op2);
> +         stval = force_gimple_operand_gsi (&gsi, stval, true, NULL_TREE,
> +                                           false, GSI_CONTINUE_LINKING);
> +       }
> +
> +      stmt = gimple_build_assign (bytes, nb_bytes);
> +      gimple_set_location (stmt, partition->loc);
> +      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +      stmt = gimple_build_assign (ptr, mem);
> +      gimple_set_location (stmt, partition->loc);
> +      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +      gsi = gsi_start_bb (split_block (gsi_bb (gsi), stmt)->dest);
> +
> +      for (unsigned i = 0; i <= xbits; i++)
> +       {
> +         tree blksz = build_int_cst (size_type_node, align >> i);
> +
> +         if (i > 0)
> +           {
> +             unsigned bits = (align >> i) * BITS_PER_UNIT;
> +             tree type = build_nonstandard_integer_type (bits, true);
> +             stval = fold_convert (type, stval);
> +             stval = force_gimple_operand_gsi (&gsi, stval, true, NULL_TREE,
> +                                               false, GSI_CONTINUE_LINKING);
> +           }
> +
> +         tree labcond = NULL; // create_artificial_label (partition->loc);
> +         tree labskip = NULL; // create_artificial_label (partition->loc);
> +
> +         stmt = gimple_build_cond (GE_EXPR, bytes, blksz, labcond, labskip);
> +         gimple_set_location (stmt, partition->loc);
> +         gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +         basic_block bbfrom = gsi_bb (gsi);
> +         basic_block bbcond = split_block (bbfrom, stmt)->dest;
> +
> +         gsi = gsi_start_bb (bbcond);
> +
> +         stmt = gimple_build_assign (fold_build2
> +                                     (MEM_REF, TREE_TYPE (stval), ptr,
> +                                      build_int_cst (TREE_TYPE (ptr), 0)),
> +                                     stval);
> +         gimple_set_location (stmt, partition->loc);
> +         gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +         if (i == 0 || i < xbits)
> +           {
> +             stmt = gimple_build_assign (ptr,
> +                                         fold_build2_loc
> +                                         (partition->loc,
> +                                          POINTER_PLUS_EXPR,
> +                                          TREE_TYPE (ptr),
> +                                          ptr, blksz));
> +             gimple_set_location (stmt, partition->loc);
> +             gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +             stmt = gimple_build_assign (bytes,
> +                                         fold_build2_loc
> +                                         (partition->loc,
> +                                          MINUS_EXPR,
> +                                          TREE_TYPE (bytes),
> +                                          bytes, blksz));
> +             gimple_set_location (stmt, partition->loc);
> +             gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +           }
> +
> +         basic_block bbskip = split_block (bbcond, stmt)->dest;
> +
> +         if (i == 0)
> +           redirect_edge_succ (single_succ_edge (bbcond), bbfrom);
> +
> +         single_succ_edge (bbfrom)->flags |= EDGE_TRUE_VALUE;
> +         single_succ_edge (bbfrom)->flags &= ~EDGE_FALLTHRU;
> +         make_edge (bbfrom, bbskip, EDGE_FALSE_VALUE);
> +         set_immediate_dominator (CDI_DOMINATORS, bbskip, bbfrom);
> +       }
> +
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file,
> +                "generated memset, inlined with %u-trip loop ctzs %u..%u \n",
> +                unsigned((wi::rshift (szmax, alctz, UNSIGNED)
> +                          + xbits).to_uhwi ()),
> +                alctz, szctz);
> +
> +      return;
> +    }
> +
>    fn = build_fold_addr_expr (builtin_decl_implicit (BUILT_IN_MEMSET));
>    fn_call = gimple_build_call (fn, 3, mem, val, nb_bytes);
>    gimple_set_location (fn_call, partition->loc);
> @@ -1202,6 +1341,27 @@ generate_memcpy_builtin (class loop *loop, partition *partition)
>    /* The new statements will be placed before LOOP.  */
>    gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
>
> +  /* Compute builtin->size range and ctz before it's gimplified; sub-expressions
> +     thereof are rewritten in place, so they end up referencing SSA_NAMEs for
> +     which we don't have VR info.  */
> +  unsigned align = (MIN (get_pointer_alignment (builtin->dst_base),
> +                        get_pointer_alignment (builtin->src_base))
> +                   / BITS_PER_UNIT);
> +  unsigned alctz, szctz, xbits;
> +  wide_int szmin, szmax;
> +  value_range_kind vrk;
> +  if (align)
> +    {
> +      alctz = wi::ctz (align);
> +      szctz = MIN (tree_ctz (builtin->size), alctz);
> +      xbits = alctz - szctz;
> +      vrk = determine_value_range (builtin->size, &szmin, &szmax);
> +      /* A call with constant size will be expanded into an unrolled loop
> +        later.  */
> +      if (szmin == szmax)
> +       align = 0;
> +    }
> +
>    nb_bytes = rewrite_to_non_trapping_overflow (builtin->size);
>    nb_bytes = force_gimple_operand_gsi (&gsi, nb_bytes, true, NULL_TREE,
>                                        false, GSI_CONTINUE_LINKING);
> @@ -1211,12 +1371,127 @@ generate_memcpy_builtin (class loop *loop, partition *partition)
>        || ! ptr_derefs_may_alias_p (dest, src))
>      kind = BUILT_IN_MEMCPY;
>    else
> -    kind = BUILT_IN_MEMMOVE;
> +    {
> +      kind = BUILT_IN_MEMMOVE;
> +      /* The inlined loop won't handle overlapping memory areas.  */
> +      align = 0;
> +    }
>
>    dest = force_gimple_operand_gsi (&gsi, dest, true, NULL_TREE,
>                                    false, GSI_CONTINUE_LINKING);
>    src = force_gimple_operand_gsi (&gsi, src, true, NULL_TREE,
>                                   false, GSI_CONTINUE_LINKING);
> +
> +  unsigned HOST_WIDE_INT ratio;
> +  ratio = MOVE_RATIO (optimize_loop_for_speed_p (loop));
> +
> +  /* Compare the ratio with the number of (most-aligned) required stores needed
> +     for szmax bytes, with the ratio: a loop that iterates up to szmax >> alctz,
> +     and then one conditional store for each extra bit of alignment that the
> +     destination has over the size.  */
> +  if (align && vrk == VR_RANGE
> +      && wi::ltu_p (wi::rshift (szmax, alctz, UNSIGNED) + xbits, ratio))
> +    {
> +      gimple *stmt;
> +      tree bytes = create_tmp_var_raw (size_type_node, "ldistbytes");
> +      tree dptr = create_tmp_var_raw (build_pointer_type (char_type_node),
> +                                    "ldistdptr");
> +      tree sptr = create_tmp_var_raw (build_pointer_type (char_type_node),
> +                                    "ldistsptr");
> +
> +      stmt = gimple_build_assign (bytes, nb_bytes);
> +      gimple_set_location (stmt, partition->loc);
> +      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +      stmt = gimple_build_assign (dptr, dest);
> +      gimple_set_location (stmt, partition->loc);
> +      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +      stmt = gimple_build_assign (sptr, src);
> +      gimple_set_location (stmt, partition->loc);
> +      gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +      gsi = gsi_start_bb (split_block (gsi_bb (gsi), stmt)->dest);
> +
> +      for (unsigned i = 0; i <= xbits; i++)
> +       {
> +         tree blksz = build_int_cst (size_type_node, align >> i);
> +         unsigned bits = (align >> i) * BITS_PER_UNIT;
> +         tree type = build_nonstandard_integer_type (bits, true);
> +         tree val = make_ssa_name (type);
> +
> +         stmt = gimple_build_cond (GE_EXPR, bytes, blksz, NULL, NULL);
> +         gimple_set_location (stmt, partition->loc);
> +         gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +         basic_block bbfrom = gsi_bb (gsi);
> +         basic_block bbcond = split_block (bbfrom, stmt)->dest;
> +
> +         gsi = gsi_start_bb (bbcond);
> +         stmt = gimple_build_assign (val,
> +                                     fold_build2
> +                                     (MEM_REF, TREE_TYPE (val), sptr,
> +                                      build_int_cst (TREE_TYPE (sptr), 0)));
> +         gimple_set_location (stmt, partition->loc);
> +         gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +         stmt = gimple_build_assign (fold_build2
> +                                     (MEM_REF, TREE_TYPE (val), dptr,
> +                                      build_int_cst (TREE_TYPE (dptr), 0)),
> +                                     val);
> +         gimple_set_location (stmt, partition->loc);
> +         gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +         if (i == 0 || i < xbits)
> +           {
> +             stmt = gimple_build_assign (sptr,
> +                                         fold_build2_loc
> +                                         (partition->loc,
> +                                          POINTER_PLUS_EXPR,
> +                                          TREE_TYPE (sptr),
> +                                          sptr, blksz));
> +             gimple_set_location (stmt, partition->loc);
> +             gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +             stmt = gimple_build_assign (dptr,
> +                                         fold_build2_loc
> +                                         (partition->loc,
> +                                          POINTER_PLUS_EXPR,
> +                                          TREE_TYPE (dptr),
> +                                          dptr, blksz));
> +             gimple_set_location (stmt, partition->loc);
> +             gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +
> +             stmt = gimple_build_assign (bytes,
> +                                         fold_build2_loc
> +                                         (partition->loc,
> +                                          MINUS_EXPR,
> +                                          TREE_TYPE (bytes),
> +                                          bytes, blksz));
> +             gimple_set_location (stmt, partition->loc);
> +             gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
> +           }
> +
> +         basic_block bbskip = split_block (bbcond, stmt)->dest;
> +         if (i == 0)
> +           redirect_edge_succ (single_succ_edge (bbcond), bbfrom);
> +
> +         single_succ_edge (bbfrom)->flags |= EDGE_TRUE_VALUE;
> +         single_succ_edge (bbfrom)->flags &= ~EDGE_FALLTHRU;
> +         make_edge (bbfrom, bbskip, EDGE_FALSE_VALUE);
> +         set_immediate_dominator (CDI_DOMINATORS, bbskip, bbfrom);
> +       }
> +
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file,
> +                "generated memcpy, inlined with %u-trip loop ctzs %u..%u \n",
> +                unsigned((wi::rshift (szmax, alctz, UNSIGNED)
> +                          + xbits).to_uhwi ()),
> +                alctz, szctz);
> +
> +      return;
> +    }
> +
>    fn = build_fold_addr_expr (builtin_decl_implicit (kind));
>    fn_call = gimple_build_call (fn, 3, dest, src, nb_bytes);
>    gimple_set_location (fn_call, partition->loc);
> @@ -3357,6 +3632,9 @@ loop_distribution::execute (function *fun)
>        scev_reset_htab ();
>        mark_virtual_operands_for_renaming (fun);
>        rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
> +      /* Newly-added loops, for inline expansions memset/memcpy, need to be
> +        integrated.  */
> +      fix_loop_structure (NULL);
>      }
>
>    checking_verify_loop_structure ();
>
>
> --
> Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
>    Free Software Activist         GNU Toolchain Engineer
>         Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

Alexandre Oliva Feb. 3, 2021, 3:11 p.m. UTC | #6

On Feb  3, 2021, Richard Biener <richard.guenther@gmail.com> wrote:

> So I think we should try to match what __builtin_memcpy/memset
> expansion would do here, taking advantage of extra alignment
> and size knowledge.  In particular,

>  a) if __builtin_memcpy/memset would use setmem/cpymem optabs
>      see if we can have variants of memcpy/memset transferring alignment
>      and size knowledge

We could add more optional parameters to them to convey the length's
known ctz.  However, the ctz can't be recovered reliably.  We can't even
recover it after gimplifying the length within ldist!

That said, my other patch already enables ctz calls to recover it, at
least in libgcc risc-v tfmode cases, and it's possible it's readily
available in other cases.  I'd rather leave that for someone dealing
with the machine-specific patterns to figure out whether a separate
argument would be useful.  RISC-V, which is what I'm dealing with,
doesn't have much to offer as far as these patterns are concerned.

>  b) if expansion would use BY_PIECES then expand to an unrolled loop

Why would that be better than keeping the constant-length memset call,
that would be turned into an unrolled loop during expand?

>  c) if expansion would emit a memset/memcpy call but we know
>      alignment and have a low bound on niters emit a loop (like your patch does)

> a) might be difficult but adding the builtin variants may pay off in any case.

*nod*

> The patch itself could benefit from one or two helpers we already
> have, first of all there's create_empty_loop_on_edge (so you don't
> need the loop fixup)

Uhh, thanks, but...  you realize nearly all of the gimple-building code
is one and the same for the loop and for trailing count misalignment?
There doesn't seem to be a lot of benefit to using this primitive, aside
from its dealing with creating the loop data structure which, again, I'd
only want to do in the loop case.

(I guess I should add more comments as to the inline expansion
 strategy.  it's equivalent to:

 i = len, ptr = base, blksz = 1 << alctz;
 while (i >= blksz) { *(ub<blksz>*)ptr = val; i -= blksz; ptr += blksz; }
 blksz >>= 1; if (i >= blksz) { *(ub<blksz>*)ptr = val; i -= blksz; ptr += blksz; }
 blksz >>= 1; if (i >= blksz) { *(ub<blksz>*)ptr = val; i -= blksz; ptr += blksz; }
 ... until blksz gets down to zero or to 1<<szctz

> Note that for memmove if we know the dependence direction, we
> can also emit a loop / unrolled code.

*nod*, but the logic would have to be quite different, using bit tests,
and odds are we won't know the direction and have to output a test and
code for both possibilities, which would be quite unlikely to be
beneficial.  Though the original code would quite likely make the
direction visible; perhaps if the size is small enough that we would
expand a memcpy inline, we should refrain from transforming the loop
into a memmove call.

In the case at hand, there's no benefit at all to these transformations:
we start from a loop with the known alignment and a small loop count (0
to 2 words copied), and if everything goes all right with the
transformation, we may be lucky to get back to that.  It's not like the
transformation could even increase the known alignment, so why bother,
and throw away debug info by rewriting the loop into the same code minus
debug?

> I think the builtins with alignment and calloc-style element count
> will be useful on its own.

Oh, I see, you're suggesting actual separate builtin functions.  Uhh...
I'm not sure I want to go there.  I'd much rather recover the ctz of the
length, and use it in existing code.

I'd also prefer if the generic memset (and memcpy and memmove?) builtin
expanders dealt with non-constant lengths in the way I implemented.
That feels like the right spot for it.  That deprives us of gimple loop
optimizations in the inlined loop generated by the current patch, but if
we expand an unrolled loop with compares and offsets with small
constants, loop optimizations might not even be relevant.

FWIW, the patch I posted yesterday is broken, the regstrap test did not
even build libstdc++-v3 successfully.  I'm not sure whether to pursue it
further, or to reimplement it in the expander.  Suggestions are welcome.

Richard Biener Feb. 4, 2021, 8:37 a.m. UTC | #7

On Wed, Feb 3, 2021 at 4:11 PM Alexandre Oliva <oliva@adacore.com> wrote:
>
> On Feb  3, 2021, Richard Biener <richard.guenther@gmail.com> wrote:
>
> > So I think we should try to match what __builtin_memcpy/memset
> > expansion would do here, taking advantage of extra alignment
> > and size knowledge.  In particular,
>
> >  a) if __builtin_memcpy/memset would use setmem/cpymem optabs
> >      see if we can have variants of memcpy/memset transferring alignment
> >      and size knowledge
>
> We could add more optional parameters to them to convey the length's
> known ctz.  However, the ctz can't be recovered reliably.  We can't even
> recover it after gimplifying the length within ldist!
>
> That said, my other patch already enables ctz calls to recover it, at
> least in libgcc risc-v tfmode cases, and it's possible it's readily
> available in other cases.  I'd rather leave that for someone dealing
> with the machine-specific patterns to figure out whether a separate
> argument would be useful.  RISC-V, which is what I'm dealing with,
> doesn't have much to offer as far as these patterns are concerned.
>
> >  b) if expansion would use BY_PIECES then expand to an unrolled loop
>
> Why would that be better than keeping the constant-length memset call,
> that would be turned into an unrolled loop during expand?

Well, because of the possibly lost ctz and alignment info.  There's also
the eventual benefit of eliding either the source (for memcpy) if the
variable accesses in the original loop or the address-taken for the
memcpy/memset are the only reasons it is not.  Of course that's then
again a general GIMPLE memcpy/memset inlining deficit (we only
"inline" it if it's possible to express the operation with one read and
one write).

> >  c) if expansion would emit a memset/memcpy call but we know
> >      alignment and have a low bound on niters emit a loop (like your patch does)
>
> > a) might be difficult but adding the builtin variants may pay off in any case.
>
> *nod*
>
> > The patch itself could benefit from one or two helpers we already
> > have, first of all there's create_empty_loop_on_edge (so you don't
> > need the loop fixup)
>
> Uhh, thanks, but...  you realize nearly all of the gimple-building code
> is one and the same for the loop and for trailing count misalignment?

Sorry, the code lacked comments and so I didn't actually try decipering
the code you generate ;)

> There doesn't seem to be a lot of benefit to using this primitive, aside
> from its dealing with creating the loop data structure which, again, I'd
> only want to do in the loop case.
>
> (I guess I should add more comments as to the inline expansion
>  strategy.  it's equivalent to:
>
>  i = len, ptr = base, blksz = 1 << alctz;
>  while (i >= blksz) { *(ub<blksz>*)ptr = val; i -= blksz; ptr += blksz; }
>  blksz >>= 1; if (i >= blksz) { *(ub<blksz>*)ptr = val; i -= blksz; ptr += blksz; }
>  blksz >>= 1; if (i >= blksz) { *(ub<blksz>*)ptr = val; i -= blksz; ptr += blksz; }
>  ... until blksz gets down to zero or to 1<<szctz

Oh, I see ... fancy.  Maybe a little too fancy ;)  I'd have emitted a loop
storing 1<<szctz pieces but then this would likely be simply the
original loop in most cases.  Which, for loop distribution, hints at
an alternative implementation of simply guessing whether a generated
memcpy/memset would be inlined by RTL expansion and in that
case _not_ generate a builtin but mark the partitioned loop for GIMPLE
complete unrolling.  That might not be optimal and relies on later
store-merging to form larger writes of course.

> > Note that for memmove if we know the dependence direction, we
> > can also emit a loop / unrolled code.
>
> *nod*, but the logic would have to be quite different, using bit tests,
> and odds are we won't know the direction and have to output a test and
> code for both possibilities, which would be quite unlikely to be
> beneficial.  Though the original code would quite likely make the
> direction visible; perhaps if the size is small enough that we would
> expand a memcpy inline, we should refrain from transforming the loop
> into a memmove call.
>
> In the case at hand, there's no benefit at all to these transformations:
> we start from a loop with the known alignment and a small loop count (0
> to 2 words copied), and if everything goes all right with the
> transformation, we may be lucky to get back to that.  It's not like the
> transformation could even increase the known alignment, so why bother,
> and throw away debug info by rewriting the loop into the same code minus
> debug?

The original motivation was really that esp. for small trip count loops
the target knows best how to implement them.  Now, that completely
fails of course in case the target doesn't implement any of this or
the generic code fails because we lost ctz and alignment info.

> > I think the builtins with alignment and calloc-style element count
> > will be useful on its own.
>
> Oh, I see, you're suggesting actual separate builtin functions.  Uhh...
> I'm not sure I want to go there.  I'd much rather recover the ctz of the
> length, and use it in existing code.

Yeah, but when we generate memcpy there might not be a way to
store the ctz info until RTL expansion where the magic should really happen ...

> I'd also prefer if the generic memset (and memcpy and memmove?) builtin
> expanders dealt with non-constant lengths in the way I implemented.

*nod*

> That feels like the right spot for it.  That deprives us of gimple loop
> optimizations in the inlined loop generated by the current patch, but if
> we expand an unrolled loop with compares and offsets with small
> constants, loop optimizations might not even be relevant.
>
>
> FWIW, the patch I posted yesterday is broken, the regstrap test did not
> even build libstdc++-v3 successfully.  I'm not sure whether to pursue it
> further, or to reimplement it in the expander.  Suggestions are welcome.

So I think we agree that ideally RTL expansion should be the place to
recover the loop.  There's left to be see how to most easily transfer
alignment and ctz info there.  I guess alternatively one could indeed
make the point that _not_ doing the memcpy/memset replacement
is likely least surprising than trying to do fancy inlining in loop distribution
as compared to RTL expansion.

So I'd say go for improving RTL expansion.  If that for some reason
fails then anticipating the RTL expansion failure cases in loop
distribution and not generating the builtin there (but the loop
as it would be distributed) is probably the best and simplest thing.

We might consider doing RTL-like *_BY_PIECES expansion of
builtins on GIMPLE as well, not at random places through fold_stmt,
but at a select point in the pass pipeline.

Thanks,
Richard.

> --
> Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
>    Free Software Activist         GNU Toolchain Engineer
>         Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

Alexandre Oliva Feb. 4, 2021, 10:17 p.m. UTC | #8

On Feb  4, 2021, Richard Biener <richard.guenther@gmail.com> wrote:

>> >  b) if expansion would use BY_PIECES then expand to an unrolled loop
>> 
>> Why would that be better than keeping the constant-length memset call,
>> that would be turned into an unrolled loop during expand?

> Well, because of the possibly lost ctz and alignment info.

Funny you should mention that.  I got started with the expand-time
expansion yesterday, and found out that we're not using the alignment
information that is available.  Though the pointer is known to point to
an aligned object, we are going for 8-bit alignment for some reason.

The strategy I used there was to first check whether by_pieces would
expand inline a constant length near the max known length, then loop
over the bits in the variable length, expand in each iteration a
constant-length store-by-pieces for the fixed length corresponding to
that bit, and a test comparing the variable length with the fixed length
guarding the expansion of the store-by-pieces.  We may get larger code
this way (no loops), but only O(log(len)) compares.

I've also fixed some bugs in the ldist expander, so now it bootstraps,
but with a few regressions in the testsuite, that I'm yet to look into.

>> Uhh, thanks, but...  you realize nearly all of the gimple-building code
>> is one and the same for the loop and for trailing count misalignment?

> Sorry, the code lacked comments and so I didn't actually try decipering
> the code you generate ;)

Oh, come on, it was planly obscure ;-D

Sorry for posting an early-draft before polishing it up.

> The original motivation was really that esp. for small trip count loops
> the target knows best how to implement them.  Now, that completely
> fails of course in case the target doesn't implement any of this or
> the generic code fails because we lost ctz and alignment info.

In our case, generic code fails because it won't handle variable-sized
clear-by-pieces.  But then, I found out, when it's fixed-size, it also
makes the code worse, because it seems to expand to byte stores even
when the store-to object is known to have wider alignment:

union u {
  long long i;
  char c[8];
} x[8];
int s(union u *p, int k) {
  for (int i = k ? 0 : 3; i < 8; i++) {
    for (int j = 0; j < 8; j++) {
      p[i].c[j] = 0;
    } // becomes a memset to an 8-byte-aligned 8-byte object, then 8 byte stores
  }
}

>> > I think the builtins with alignment and calloc-style element count
>> > will be useful on its own.
>> 
>> Oh, I see, you're suggesting actual separate builtin functions.  Uhh...
>> I'm not sure I want to go there.  I'd much rather recover the ctz of the
>> length, and use it in existing code.

> Yeah, but when we generate memcpy there might not be a way to
> store the ctz info until RTL expansion where the magic should really happen ...

True.  It can be recovered without much difficulty in the cases I've
looked at, but it could be lost in others.

> So I'd say go for improving RTL expansion.

'k, thanks

Jim Wilson Feb. 5, 2021, 12:13 a.m. UTC | #9

On Wed, Jan 27, 2021 at 4:40 AM Alexandre Oliva <oliva@adacore.com> wrote:

> This patch attempts to fix a libgcc codegen regression introduced in
> gcc-10, as -ftree-loop-distribute-patterns was enabled at -O2.
>
> RISC-V doesn't have any setmemM pattern, so the loops above end up
> "optimized" into memset calls, incurring not only the overhead of an
> explicit call, but also discarding the information the compiler has
> about the alignment of the destination, and that the length is a
> multiple of the word alignment.
>
> FYI we have a bug report for this for a coremark regression which sounds
like the same problem.
     https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

Jim

Richard Biener Feb. 5, 2021, 8:02 a.m. UTC | #10

On Thu, Feb 4, 2021 at 11:18 PM Alexandre Oliva <oliva@adacore.com> wrote:
>
> On Feb  4, 2021, Richard Biener <richard.guenther@gmail.com> wrote:
>
> >> >  b) if expansion would use BY_PIECES then expand to an unrolled loop
> >>
> >> Why would that be better than keeping the constant-length memset call,
> >> that would be turned into an unrolled loop during expand?
>
> > Well, because of the possibly lost ctz and alignment info.
>
> Funny you should mention that.  I got started with the expand-time
> expansion yesterday, and found out that we're not using the alignment
> information that is available.  Though the pointer is known to point to
> an aligned object, we are going for 8-bit alignment for some reason.
>
> The strategy I used there was to first check whether by_pieces would
> expand inline a constant length near the max known length, then loop
> over the bits in the variable length, expand in each iteration a
> constant-length store-by-pieces for the fixed length corresponding to
> that bit, and a test comparing the variable length with the fixed length
> guarding the expansion of the store-by-pieces.  We may get larger code
> this way (no loops), but only O(log(len)) compares.
>
> I've also fixed some bugs in the ldist expander, so now it bootstraps,
> but with a few regressions in the testsuite, that I'm yet to look into.
>
> >> Uhh, thanks, but...  you realize nearly all of the gimple-building code
> >> is one and the same for the loop and for trailing count misalignment?
>
> > Sorry, the code lacked comments and so I didn't actually try decipering
> > the code you generate ;)
>
> Oh, come on, it was planly obscure ;-D
>
> Sorry for posting an early-draft before polishing it up.
>
> > The original motivation was really that esp. for small trip count loops
> > the target knows best how to implement them.  Now, that completely
> > fails of course in case the target doesn't implement any of this or
> > the generic code fails because we lost ctz and alignment info.
>
> In our case, generic code fails because it won't handle variable-sized
> clear-by-pieces.  But then, I found out, when it's fixed-size, it also
> makes the code worse, because it seems to expand to byte stores even
> when the store-to object is known to have wider alignment:
>
> union u {
>   long long i;
>   char c[8];
> } x[8];
> int s(union u *p, int k) {
>   for (int i = k ? 0 : 3; i < 8; i++) {
>     for (int j = 0; j < 8; j++) {
>       p[i].c[j] = 0;
>     } // becomes a memset to an 8-byte-aligned 8-byte object, then 8 byte stores
>   }
> }

On x86_64 I see two DImode stores generated, but that appears to be
done via setmem.

> >> > I think the builtins with alignment and calloc-style element count
> >> > will be useful on its own.
> >>
> >> Oh, I see, you're suggesting actual separate builtin functions.  Uhh...
> >> I'm not sure I want to go there.  I'd much rather recover the ctz of the
> >> length, and use it in existing code.
>
> > Yeah, but when we generate memcpy there might not be a way to
> > store the ctz info until RTL expansion where the magic should really happen ...
>
> True.  It can be recovered without much difficulty in the cases I've
> looked at, but it could be lost in others.
>
> > So I'd say go for improving RTL expansion.
>
> 'k, thanks
>
> --
> Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
>    Free Software Activist         GNU Toolchain Engineer
>         Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

Alexandre Oliva Feb. 11, 2021, 10:11 a.m. UTC | #11

On Feb  4, 2021, Jim Wilson <jimw@sifive.com> wrote:

> FYI we have a bug report for this for a coremark regression which sounds
> like the same problem.
>      https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

Indeed, thanks!

Alexandre Oliva Feb. 11, 2021, 10:19 a.m. UTC | #12

On Feb  4, 2021, Alexandre Oliva <oliva@adacore.com> wrote:

> On Feb  4, 2021, Richard Biener <richard.guenther@gmail.com> wrote:
>>> >  b) if expansion would use BY_PIECES then expand to an unrolled loop
>>> 
>>> Why would that be better than keeping the constant-length memset call,
>>> that would be turned into an unrolled loop during expand?

>> Well, because of the possibly lost ctz and alignment info.

> Funny you should mention that.  I got started with the expand-time
> expansion yesterday, and found out that we're not using the alignment
> information that is available.  Though the pointer is known to point to
> an aligned object, we are going for 8-bit alignment for some reason.

> The strategy I used there was to first check whether by_pieces would
> expand inline a constant length near the max known length, then loop
> over the bits in the variable length, expand in each iteration a
> constant-length store-by-pieces for the fixed length corresponding to
> that bit, and a test comparing the variable length with the fixed length
> guarding the expansion of the store-by-pieces.  We may get larger code
> this way (no loops), but only O(log(len)) compares.

> I've also fixed some bugs in the ldist expander, so now it bootstraps,
> but with a few regressions in the testsuite, that I'm yet to look into.

A few more fixes later, this seems to work.

It encompasses the patch to recover tree_ctz from a mult_expr def, it
adds code to set up the alignment data for the ldist-generated memset
dest, and then it expands memset as described above if length is not
constant, setmem is not available, but the by-pieces machinery would
still be used for nearby constants.

How does this look?


[PR94092] introduce try store by multiple pieces

From: Alexandre Oliva <oliva@adacore.com>

The ldist pass turns even very short loops into memset calls.  E.g.,
the TFmode emulation calls end with a loop of up to 3 iterations, to
zero out trailing words, and the loop distribution pass turns them
into calls of the memset builtin.

Though short constant-length memsets are usually dealt with
efficiently, for non-constant-length ones, the options are setmemM, or
a function calls.

RISC-V doesn't have any setmemM pattern, so the loops above end up
"optimized" into memset calls, incurring not only the overhead of an
explicit call, but also discarding the information the compiler has
about the alignment of the destination, and that the length is a
multiple of the word alignment.

This patch handles variable lengths with multiple conditional
power-of-2-constant-sized stores-by-pieces, so as to reduce the
overhead of length compares.

It also propagates the alignment of the memset dest, introduced by
loop-distribution, to the SSA pointer used to hold it, so that it is
not discarded right away, and may be recovered by the memset builtin
expander.

Finally, it computes the ctz of the length, and uses it to omit blocks
for stores of small byte counts when we're storing whole words.
tree_ctz is improved so as to look at the length DEF stmt, and zero
out the least-significant bits if it's a multiply by a power of two.


for  gcc/ChangeLog

	PR tree-optimization/94092
	* builtins.c (try_store_by_multiple_pieces): New.
	(expand_builtin_memset_args): Use it.  If target_char_cast
	fails, proceed as for non-constant val.  Pass len's ctz to...
	* expr.c (clear_storage_hints): ... this.  Try store by
	multiple pieces after setmem.
	(clear_storage): Adjust.
	* expr.h (clear_storage_hints): Likewise.
	(try_store_by_multiple_pieces): Declare.
	* tree-loop-distribution.c: Include builtins.h.
	(generate_memset_builtin): Propagate dst_base alignmen to mem.
	* tree-ssanames.c (get_nonzero_bits): Zero out low bits of
	integral types, when a MULT_EXPR INTEGER_CST operand ensures
	the result will be a multiple of a power of two.
---
 gcc/builtins.c               |  113 +++++++++++++++++++++++++++++++++++++++---
 gcc/expr.c                   |    9 +++
 gcc/expr.h                   |   12 ++++
 gcc/tree-loop-distribution.c |    9 +++
 gcc/tree-ssanames.c          |   23 ++++++++-
 5 files changed, 154 insertions(+), 12 deletions(-)

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 0aed008687ccc..44f3af92536a5 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -6600,6 +6600,97 @@ expand_builtin_memset (tree exp, rtx target, machine_mode mode)
   return expand_builtin_memset_args (dest, val, len, target, mode, exp);
 }
 
+/* Try to store VAL (or, if NULL_RTX, VALC) in LEN bytes starting at TO.
+   Return TRUE if successful, FALSE otherwise.  TO is assumed to be
+   aligned at an ALIGN boundary.  LEN is assumed to be a multiple of
+   1<<CTZ_LEN, and between min_len and max_len.
+
+   The strategy is to issue one store_by_pieces for each power of two,
+   for most to least significant, guarded by a test on whether there are
+   at least that many bytes to copy.  */
+
+bool
+try_store_by_multiple_pieces (rtx to, rtx len, int ctz_len,
+			      unsigned HOST_WIDE_INT min_len,
+			      unsigned HOST_WIDE_INT max_len,
+			      rtx val, char valc, unsigned int align)
+{
+  int max_bits = floor_log2 (max_len);
+  int min_bits = floor_log2 (min_len);
+
+  if (val)
+    valc = 1;
+
+  if (!can_store_by_pieces ((HOST_WIDE_INT_1U << max_bits) * 2
+			    - (HOST_WIDE_INT_1U << ctz_len),
+			    builtin_memset_read_str, &valc, align, true))
+    return false;
+
+  rtx (*constfun) (void *, HOST_WIDE_INT, scalar_int_mode);
+  void *constfundata;
+  if (val)
+    {
+      constfun = builtin_memset_gen_str;
+      constfundata = val = force_reg (TYPE_MODE (unsigned_char_type_node),
+				      val);
+    }
+  else
+    {
+      constfun = builtin_memset_read_str;
+      constfundata = &valc;
+    }
+
+  /* Bits above SHR_BITS don't need to be tested.  */
+  int shr_bits = (max_bits != min_bits
+		  ? max_bits
+		  : floor_log2 (max_len ^ min_len));
+
+  rtx ptr = copy_addr_to_reg (convert_to_mode (ptr_mode, XEXP (to, 0), 0));
+  rtx rem = copy_to_mode_reg (ptr_mode, convert_to_mode (ptr_mode, len, 0));
+  set_mem_align (to, align);
+
+  for (int i = max_bits; i >= ctz_len; i--)
+    {
+      unsigned HOST_WIDE_INT blksize = HOST_WIDE_INT_1U << i;
+      rtx_code_label *label = NULL;
+
+      if (i <= shr_bits)
+	{
+	  label = gen_label_rtx ();
+	  emit_cmp_and_jump_insns (rem, GEN_INT (blksize), LT, NULL,
+				   ptr_mode, 1, label,
+				   profile_probability::even ());
+	}
+      else if ((max_len & blksize) == 0)
+	continue;
+
+      if (align > blksize * BITS_PER_UNIT)
+	{
+	  align = blksize * BITS_PER_UNIT;
+	  set_mem_align (to, align);
+	}
+
+      to = store_by_pieces (to, blksize,
+			    constfun, constfundata,
+			    align, true, RETURN_END);
+
+      if (label)
+	{
+	  emit_move_insn (ptr, XEXP (to, 0));
+	  XEXP (to, 0) = ptr;
+
+	  clear_mem_offset (to);
+	  clear_mem_size (to);
+
+	  emit_move_insn (rem, plus_constant (ptr_mode, rem, -blksize));
+
+	  emit_label (label);
+	}
+    }
+
+  return true;
+}
+
 /* Helper function to do the actual work for expand_builtin_memset.  The
    arguments to the builtin_memset call DEST, VAL, and LEN are broken out
    so that this can also be called without constructing an actual CALL_EXPR.
@@ -6654,7 +6745,8 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
   dest_mem = get_memory_rtx (dest, len);
   val_mode = TYPE_MODE (unsigned_char_type_node);
 
-  if (TREE_CODE (val) != INTEGER_CST)
+  if (TREE_CODE (val) != INTEGER_CST
+      || target_char_cast (val, &c))
     {
       rtx val_rtx;
 
@@ -6678,7 +6770,12 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
       else if (!set_storage_via_setmem (dest_mem, len_rtx, val_rtx,
 					dest_align, expected_align,
 					expected_size, min_size, max_size,
-					probable_max_size))
+					probable_max_size)
+	       && !try_store_by_multiple_pieces (dest_mem, len_rtx,
+						 tree_ctz (len),
+						 min_size, max_size,
+						 val_rtx, 0,
+						 dest_align))
 	goto do_libcall;
 
       dest_mem = force_operand (XEXP (dest_mem, 0), NULL_RTX);
@@ -6686,9 +6783,6 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
       return dest_mem;
     }
 
-  if (target_char_cast (val, &c))
-    goto do_libcall;
-
   if (c)
     {
       if (tree_fits_uhwi_p (len)
@@ -6702,7 +6796,12 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
 					gen_int_mode (c, val_mode),
 					dest_align, expected_align,
 					expected_size, min_size, max_size,
-					probable_max_size))
+					probable_max_size)
+	       && !try_store_by_multiple_pieces (dest_mem, len_rtx,
+						 tree_ctz (len),
+						 min_size, max_size,
+						 NULL_RTX, c,
+						 dest_align))
 	goto do_libcall;
 
       dest_mem = force_operand (XEXP (dest_mem, 0), NULL_RTX);
@@ -6716,7 +6815,7 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
 				   ? BLOCK_OP_TAILCALL : BLOCK_OP_NORMAL,
 				   expected_align, expected_size,
 				   min_size, max_size,
-				   probable_max_size);
+				   probable_max_size, tree_ctz (len));
 
   if (dest_addr == 0)
     {
diff --git a/gcc/expr.c b/gcc/expr.c
index 04ef5ad114d06..b212e9ac575a7 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -3056,7 +3056,8 @@ clear_storage_hints (rtx object, rtx size, enum block_op_methods method,
 		     unsigned int expected_align, HOST_WIDE_INT expected_size,
 		     unsigned HOST_WIDE_INT min_size,
 		     unsigned HOST_WIDE_INT max_size,
-		     unsigned HOST_WIDE_INT probable_max_size)
+		     unsigned HOST_WIDE_INT probable_max_size,
+		     unsigned ctz_size)
 {
   machine_mode mode = GET_MODE (object);
   unsigned int align;
@@ -3103,6 +3104,10 @@ clear_storage_hints (rtx object, rtx size, enum block_op_methods method,
 				   expected_align, expected_size,
 				   min_size, max_size, probable_max_size))
     ;
+  else if (try_store_by_multiple_pieces (object, size, ctz_size,
+					 min_size, max_size,
+					 NULL_RTX, 0, align))
+    ;
   else if (ADDR_SPACE_GENERIC_P (MEM_ADDR_SPACE (object)))
     return set_storage_via_libcall (object, size, const0_rtx,
 				    method == BLOCK_OP_TAILCALL);
@@ -3120,7 +3125,7 @@ clear_storage (rtx object, rtx size, enum block_op_methods method)
     min = max = UINTVAL (size);
   else
     max = GET_MODE_MASK (GET_MODE (size));
-  return clear_storage_hints (object, size, method, 0, -1, min, max, max);
+  return clear_storage_hints (object, size, method, 0, -1, min, max, max, 0);
 }
 
 
diff --git a/gcc/expr.h b/gcc/expr.h
index 1f0177a4cfa5d..6ff2384df63f8 100644
--- a/gcc/expr.h
+++ b/gcc/expr.h
@@ -193,7 +193,8 @@ extern rtx clear_storage_hints (rtx, rtx, enum block_op_methods,
 			        unsigned int, HOST_WIDE_INT,
 				unsigned HOST_WIDE_INT,
 				unsigned HOST_WIDE_INT,
-				unsigned HOST_WIDE_INT);
+				unsigned HOST_WIDE_INT,
+				unsigned);
 /* The same, but always output an library call.  */
 extern rtx set_storage_via_libcall (rtx, rtx, rtx, bool = false);
 
@@ -224,6 +225,15 @@ extern int can_store_by_pieces (unsigned HOST_WIDE_INT,
 extern rtx store_by_pieces (rtx, unsigned HOST_WIDE_INT, by_pieces_constfn,
 			    void *, unsigned int, bool, memop_ret);
 
+/* If can_store_by_pieces passes for worst-case values near MAX_LEN, call
+   store_by_pieces within conditionals so as to handle variable LEN efficiently,
+   storing VAL, if non-NULL_RTX, or valc instead.  */
+extern bool try_store_by_multiple_pieces (rtx to, rtx len, int ctz_len,
+					  unsigned HOST_WIDE_INT min_len,
+					  unsigned HOST_WIDE_INT max_len,
+					  rtx val, char valc,
+					  unsigned int align);
+
 /* Emit insns to set X from Y.  */
 extern rtx_insn *emit_move_insn (rtx, rtx);
 extern rtx_insn *gen_move_insn (rtx, rtx);
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index bb15fd3723fb6..a64b89037a724 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -115,6 +115,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-vectorizer.h"
 #include "tree-eh.h"
 #include "gimple-fold.h"
+#include "builtins.h"
 
 
 #define MAX_DATAREFS_NUM \
@@ -1155,6 +1156,14 @@ generate_memset_builtin (class loop *loop, partition *partition)
   mem = force_gimple_operand_gsi (&gsi, mem, true, NULL_TREE,
 				  false, GSI_CONTINUE_LINKING);
 
+  if (TREE_CODE (mem) == SSA_NAME)
+    if (ptr_info_def *pi = get_ptr_info (mem))
+      {
+	unsigned al = get_pointer_alignment (builtin->dst_base);
+	if (al > pi->align || pi->misalign)
+	  set_ptr_info_alignment (pi, al, 0);
+      }
+
   /* This exactly matches the pattern recognition in classify_partition.  */
   val = gimple_assign_rhs1 (DR_STMT (builtin->dst_dr));
   /* Handle constants like 0x15151515 and similarly
diff --git a/gcc/tree-ssanames.c b/gcc/tree-ssanames.c
index 51a26d2fce1c2..c4b5bf2a4999a 100644
--- a/gcc/tree-ssanames.c
+++ b/gcc/tree-ssanames.c
@@ -546,10 +546,29 @@ get_nonzero_bits (const_tree name)
     }
 
   range_info_def *ri = SSA_NAME_RANGE_INFO (name);
+  wide_int ret;
   if (!ri)
-    return wi::shwi (-1, precision);
+    ret = wi::shwi (-1, precision);
+  else
+    ret = ri->get_nonzero_bits ();
+
+  /* If NAME is defined as a multiple of a constant C, we know the ctz(C) low
+     bits are zero.  ??? Should we handle LSHIFT_EXPR too?  Non-constants,
+     e.g. the minimum shift count, and ctz from both MULT_EXPR operands?  That
+     could make for deep recursion.  */
+  if (INTEGRAL_TYPE_P (TREE_TYPE (name))
+      && SSA_NAME_DEF_STMT (name)
+      && is_gimple_assign (SSA_NAME_DEF_STMT (name))
+      && gimple_assign_rhs_code (SSA_NAME_DEF_STMT (name)) == MULT_EXPR
+      && TREE_CODE (gimple_assign_rhs2 (SSA_NAME_DEF_STMT (name))) == INTEGER_CST)
+    {
+      unsigned HOST_WIDE_INT bits
+	= tree_ctz (gimple_assign_rhs2 (SSA_NAME_DEF_STMT (name)));
+      wide_int mask = wi::shwi (-1, precision) << bits;
+      ret &= mask;
+    }
 
-  return ri->get_nonzero_bits ();
+  return ret;
 }
 
 /* Return TRUE is OP, an SSA_NAME has a range of values [0..1], false

Alexandre Oliva Feb. 11, 2021, 12:14 p.m. UTC | #13

On Feb 11, 2021, Alexandre Oliva <oliva@adacore.com> wrote:

> How does this look?

> for  gcc/ChangeLog

> 	PR tree-optimization/94092
> 	* builtins.c (try_store_by_multiple_pieces): New.
> 	(expand_builtin_memset_args): Use it.  If target_char_cast
> 	fails, proceed as for non-constant val.  Pass len's ctz to...
> 	* expr.c (clear_storage_hints): ... this.  Try store by
> 	multiple pieces after setmem.
> 	(clear_storage): Adjust.
> 	* expr.h (clear_storage_hints): Likewise.
> 	(try_store_by_multiple_pieces): Declare.
> 	* tree-loop-distribution.c: Include builtins.h.
> 	(generate_memset_builtin): Propagate dst_base alignmen to mem.
> 	* tree-ssanames.c (get_nonzero_bits): Zero out low bits of
> 	integral types, when a MULT_EXPR INTEGER_CST operand ensures
> 	the result will be a multiple of a power of two.

I forgot to mention this passed regstrap on x86_64-linux-gnu, as well as
some cross testing of riscv32-elf.

I've also regstrapped it on x86_64-linux-gnu along with a patch for
testing purposes, that tried try_store_by_multiple_pieces before setmem
in all 3 locations where they are called, which gives me some confidence
that the implementation is reasonably robust.

Is this ok to install?  (if not right now, perhaps in stage1)

Richard Biener Feb. 12, 2021, 11:34 a.m. UTC | #14

On Thu, Feb 11, 2021 at 11:19 AM Alexandre Oliva <oliva@adacore.com> wrote:
>
> On Feb  4, 2021, Alexandre Oliva <oliva@adacore.com> wrote:
>
> > On Feb  4, 2021, Richard Biener <richard.guenther@gmail.com> wrote:
> >>> >  b) if expansion would use BY_PIECES then expand to an unrolled loop
> >>>
> >>> Why would that be better than keeping the constant-length memset call,
> >>> that would be turned into an unrolled loop during expand?
>
> >> Well, because of the possibly lost ctz and alignment info.
>
> > Funny you should mention that.  I got started with the expand-time
> > expansion yesterday, and found out that we're not using the alignment
> > information that is available.  Though the pointer is known to point to
> > an aligned object, we are going for 8-bit alignment for some reason.
>
> > The strategy I used there was to first check whether by_pieces would
> > expand inline a constant length near the max known length, then loop
> > over the bits in the variable length, expand in each iteration a
> > constant-length store-by-pieces for the fixed length corresponding to
> > that bit, and a test comparing the variable length with the fixed length
> > guarding the expansion of the store-by-pieces.  We may get larger code
> > this way (no loops), but only O(log(len)) compares.
>
> > I've also fixed some bugs in the ldist expander, so now it bootstraps,
> > but with a few regressions in the testsuite, that I'm yet to look into.
>
> A few more fixes later, this seems to work.
>
> It encompasses the patch to recover tree_ctz from a mult_expr def, it
> adds code to set up the alignment data for the ldist-generated memset
> dest, and then it expands memset as described above if length is not
> constant, setmem is not available, but the by-pieces machinery would
> still be used for nearby constants.
>
> How does this look?

Overall it looks good - I can't really review the RTL code generation
parts because I'm not too familiar with the usual pitfalls there.

Some comments below.

>
> [PR94092] introduce try store by multiple pieces
>
> From: Alexandre Oliva <oliva@adacore.com>
>
> The ldist pass turns even very short loops into memset calls.  E.g.,
> the TFmode emulation calls end with a loop of up to 3 iterations, to
> zero out trailing words, and the loop distribution pass turns them
> into calls of the memset builtin.
>
> Though short constant-length memsets are usually dealt with
> efficiently, for non-constant-length ones, the options are setmemM, or
> a function calls.
>
> RISC-V doesn't have any setmemM pattern, so the loops above end up
> "optimized" into memset calls, incurring not only the overhead of an
> explicit call, but also discarding the information the compiler has
> about the alignment of the destination, and that the length is a
> multiple of the word alignment.
>
> This patch handles variable lengths with multiple conditional
> power-of-2-constant-sized stores-by-pieces, so as to reduce the
> overhead of length compares.
>
> It also propagates the alignment of the memset dest, introduced by
> loop-distribution, to the SSA pointer used to hold it, so that it is
> not discarded right away, and may be recovered by the memset builtin
> expander.
>
> Finally, it computes the ctz of the length, and uses it to omit blocks
> for stores of small byte counts when we're storing whole words.
> tree_ctz is improved so as to look at the length DEF stmt, and zero
> out the least-significant bits if it's a multiply by a power of two.
>
>
> for  gcc/ChangeLog
>
>         PR tree-optimization/94092
>         * builtins.c (try_store_by_multiple_pieces): New.
>         (expand_builtin_memset_args): Use it.  If target_char_cast
>         fails, proceed as for non-constant val.  Pass len's ctz to...
>         * expr.c (clear_storage_hints): ... this.  Try store by
>         multiple pieces after setmem.
>         (clear_storage): Adjust.
>         * expr.h (clear_storage_hints): Likewise.
>         (try_store_by_multiple_pieces): Declare.
>         * tree-loop-distribution.c: Include builtins.h.
>         (generate_memset_builtin): Propagate dst_base alignmen to mem.
>         * tree-ssanames.c (get_nonzero_bits): Zero out low bits of
>         integral types, when a MULT_EXPR INTEGER_CST operand ensures
>         the result will be a multiple of a power of two.
> ---
>  gcc/builtins.c               |  113 +++++++++++++++++++++++++++++++++++++++---
>  gcc/expr.c                   |    9 +++
>  gcc/expr.h                   |   12 ++++
>  gcc/tree-loop-distribution.c |    9 +++
>  gcc/tree-ssanames.c          |   23 ++++++++-
>  5 files changed, 154 insertions(+), 12 deletions(-)
>
> diff --git a/gcc/builtins.c b/gcc/builtins.c
> index 0aed008687ccc..44f3af92536a5 100644
> --- a/gcc/builtins.c
> +++ b/gcc/builtins.c
> @@ -6600,6 +6600,97 @@ expand_builtin_memset (tree exp, rtx target, machine_mode mode)
>    return expand_builtin_memset_args (dest, val, len, target, mode, exp);
>  }
>
> +/* Try to store VAL (or, if NULL_RTX, VALC) in LEN bytes starting at TO.
> +   Return TRUE if successful, FALSE otherwise.  TO is assumed to be
> +   aligned at an ALIGN boundary.  LEN is assumed to be a multiple of
> +   1<<CTZ_LEN, and between min_len and max_len.
> +
> +   The strategy is to issue one store_by_pieces for each power of two,
> +   for most to least significant, guarded by a test on whether there are
> +   at least that many bytes to copy.  */
> +
> +bool
> +try_store_by_multiple_pieces (rtx to, rtx len, int ctz_len,
> +                             unsigned HOST_WIDE_INT min_len,
> +                             unsigned HOST_WIDE_INT max_len,
> +                             rtx val, char valc, unsigned int align)
> +{
> +  int max_bits = floor_log2 (max_len);
> +  int min_bits = floor_log2 (min_len);
> +
> +  if (val)
> +    valc = 1;
> +
> +  if (!can_store_by_pieces ((HOST_WIDE_INT_1U << max_bits) * 2
> +                           - (HOST_WIDE_INT_1U << ctz_len),
> +                           builtin_memset_read_str, &valc, align, true))
> +    return false;
> +
> +  rtx (*constfun) (void *, HOST_WIDE_INT, scalar_int_mode);
> +  void *constfundata;
> +  if (val)
> +    {
> +      constfun = builtin_memset_gen_str;
> +      constfundata = val = force_reg (TYPE_MODE (unsigned_char_type_node),
> +                                     val);
> +    }
> +  else
> +    {
> +      constfun = builtin_memset_read_str;
> +      constfundata = &valc;
> +    }
> +
> +  /* Bits above SHR_BITS don't need to be tested.  */
> +  int shr_bits = (max_bits != min_bits
> +                 ? max_bits
> +                 : floor_log2 (max_len ^ min_len));
> +
> +  rtx ptr = copy_addr_to_reg (convert_to_mode (ptr_mode, XEXP (to, 0), 0));
> +  rtx rem = copy_to_mode_reg (ptr_mode, convert_to_mode (ptr_mode, len, 0));
> +  set_mem_align (to, align);
> +
> +  for (int i = max_bits; i >= ctz_len; i--)
> +    {
> +      unsigned HOST_WIDE_INT blksize = HOST_WIDE_INT_1U << i;
> +      rtx_code_label *label = NULL;
> +
> +      if (i <= shr_bits)
> +       {
> +         label = gen_label_rtx ();
> +         emit_cmp_and_jump_insns (rem, GEN_INT (blksize), LT, NULL,
> +                                  ptr_mode, 1, label,
> +                                  profile_probability::even ());
> +       }
> +      else if ((max_len & blksize) == 0)
> +       continue;
> +
> +      if (align > blksize * BITS_PER_UNIT)
> +       {
> +         align = blksize * BITS_PER_UNIT;
> +         set_mem_align (to, align);
> +       }
> +
> +      to = store_by_pieces (to, blksize,
> +                           constfun, constfundata,
> +                           align, true, RETURN_END);
> +
> +      if (label)
> +       {
> +         emit_move_insn (ptr, XEXP (to, 0));
> +         XEXP (to, 0) = ptr;
> +
> +         clear_mem_offset (to);
> +         clear_mem_size (to);
> +
> +         emit_move_insn (rem, plus_constant (ptr_mode, rem, -blksize));
> +
> +         emit_label (label);
> +       }
> +    }
> +
> +  return true;
> +}
> +
>  /* Helper function to do the actual work for expand_builtin_memset.  The
>     arguments to the builtin_memset call DEST, VAL, and LEN are broken out
>     so that this can also be called without constructing an actual CALL_EXPR.
> @@ -6654,7 +6745,8 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
>    dest_mem = get_memory_rtx (dest, len);
>    val_mode = TYPE_MODE (unsigned_char_type_node);
>
> -  if (TREE_CODE (val) != INTEGER_CST)
> +  if (TREE_CODE (val) != INTEGER_CST
> +      || target_char_cast (val, &c))
>      {
>        rtx val_rtx;
>
> @@ -6678,7 +6770,12 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
>        else if (!set_storage_via_setmem (dest_mem, len_rtx, val_rtx,
>                                         dest_align, expected_align,
>                                         expected_size, min_size, max_size,
> -                                       probable_max_size))
> +                                       probable_max_size)
> +              && !try_store_by_multiple_pieces (dest_mem, len_rtx,
> +                                                tree_ctz (len),
> +                                                min_size, max_size,
> +                                                val_rtx, 0,
> +                                                dest_align))
>         goto do_libcall;
>
>        dest_mem = force_operand (XEXP (dest_mem, 0), NULL_RTX);
> @@ -6686,9 +6783,6 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
>        return dest_mem;
>      }
>
> -  if (target_char_cast (val, &c))
> -    goto do_libcall;
> -
>    if (c)
>      {
>        if (tree_fits_uhwi_p (len)
> @@ -6702,7 +6796,12 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
>                                         gen_int_mode (c, val_mode),
>                                         dest_align, expected_align,
>                                         expected_size, min_size, max_size,
> -                                       probable_max_size))
> +                                       probable_max_size)
> +              && !try_store_by_multiple_pieces (dest_mem, len_rtx,
> +                                                tree_ctz (len),
> +                                                min_size, max_size,
> +                                                NULL_RTX, c,
> +                                                dest_align))
>         goto do_libcall;
>
>        dest_mem = force_operand (XEXP (dest_mem, 0), NULL_RTX);
> @@ -6716,7 +6815,7 @@ expand_builtin_memset_args (tree dest, tree val, tree len,
>                                    ? BLOCK_OP_TAILCALL : BLOCK_OP_NORMAL,
>                                    expected_align, expected_size,
>                                    min_size, max_size,
> -                                  probable_max_size);
> +                                  probable_max_size, tree_ctz (len));
>
>    if (dest_addr == 0)
>      {
> diff --git a/gcc/expr.c b/gcc/expr.c
> index 04ef5ad114d06..b212e9ac575a7 100644
> --- a/gcc/expr.c
> +++ b/gcc/expr.c
> @@ -3056,7 +3056,8 @@ clear_storage_hints (rtx object, rtx size, enum block_op_methods method,
>                      unsigned int expected_align, HOST_WIDE_INT expected_size,
>                      unsigned HOST_WIDE_INT min_size,
>                      unsigned HOST_WIDE_INT max_size,
> -                    unsigned HOST_WIDE_INT probable_max_size)
> +                    unsigned HOST_WIDE_INT probable_max_size,
> +                    unsigned ctz_size)
>  {
>    machine_mode mode = GET_MODE (object);
>    unsigned int align;
> @@ -3103,6 +3104,10 @@ clear_storage_hints (rtx object, rtx size, enum block_op_methods method,
>                                    expected_align, expected_size,
>                                    min_size, max_size, probable_max_size))
>      ;
> +  else if (try_store_by_multiple_pieces (object, size, ctz_size,
> +                                        min_size, max_size,
> +                                        NULL_RTX, 0, align))
> +    ;
>    else if (ADDR_SPACE_GENERIC_P (MEM_ADDR_SPACE (object)))
>      return set_storage_via_libcall (object, size, const0_rtx,
>                                     method == BLOCK_OP_TAILCALL);
> @@ -3120,7 +3125,7 @@ clear_storage (rtx object, rtx size, enum block_op_methods method)
>      min = max = UINTVAL (size);
>    else
>      max = GET_MODE_MASK (GET_MODE (size));
> -  return clear_storage_hints (object, size, method, 0, -1, min, max, max);
> +  return clear_storage_hints (object, size, method, 0, -1, min, max, max, 0);
>  }
>
>
> diff --git a/gcc/expr.h b/gcc/expr.h
> index 1f0177a4cfa5d..6ff2384df63f8 100644
> --- a/gcc/expr.h
> +++ b/gcc/expr.h
> @@ -193,7 +193,8 @@ extern rtx clear_storage_hints (rtx, rtx, enum block_op_methods,
>                                 unsigned int, HOST_WIDE_INT,
>                                 unsigned HOST_WIDE_INT,
>                                 unsigned HOST_WIDE_INT,
> -                               unsigned HOST_WIDE_INT);
> +                               unsigned HOST_WIDE_INT,
> +                               unsigned);
>  /* The same, but always output an library call.  */
>  extern rtx set_storage_via_libcall (rtx, rtx, rtx, bool = false);
>
> @@ -224,6 +225,15 @@ extern int can_store_by_pieces (unsigned HOST_WIDE_INT,
>  extern rtx store_by_pieces (rtx, unsigned HOST_WIDE_INT, by_pieces_constfn,
>                             void *, unsigned int, bool, memop_ret);
>
> +/* If can_store_by_pieces passes for worst-case values near MAX_LEN, call
> +   store_by_pieces within conditionals so as to handle variable LEN efficiently,
> +   storing VAL, if non-NULL_RTX, or valc instead.  */
> +extern bool try_store_by_multiple_pieces (rtx to, rtx len, int ctz_len,
> +                                         unsigned HOST_WIDE_INT min_len,
> +                                         unsigned HOST_WIDE_INT max_len,
> +                                         rtx val, char valc,
> +                                         unsigned int align);
> +
>  /* Emit insns to set X from Y.  */
>  extern rtx_insn *emit_move_insn (rtx, rtx);
>  extern rtx_insn *gen_move_insn (rtx, rtx);
> diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> index bb15fd3723fb6..a64b89037a724 100644
> --- a/gcc/tree-loop-distribution.c
> +++ b/gcc/tree-loop-distribution.c
> @@ -115,6 +115,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-vectorizer.h"
>  #include "tree-eh.h"
>  #include "gimple-fold.h"
> +#include "builtins.h"
>
>
>  #define MAX_DATAREFS_NUM \
> @@ -1155,6 +1156,14 @@ generate_memset_builtin (class loop *loop, partition *partition)
>    mem = force_gimple_operand_gsi (&gsi, mem, true, NULL_TREE,
>                                   false, GSI_CONTINUE_LINKING);
>
> +  if (TREE_CODE (mem) == SSA_NAME)
> +    if (ptr_info_def *pi = get_ptr_info (mem))
> +      {
> +       unsigned al = get_pointer_alignment (builtin->dst_base);
> +       if (al > pi->align || pi->misalign)

We still might prefer pi->align == 64 and pi->misalign == 32 over al == 16
so maybe factor that in, too.

Also what's still lost is the alignment constraint the original access has
(thus also the base pointer must have).  Maybe store this info in
loop distributions builtin_info and compute it from the data reference
for example via get_object_alignment (DR_REF (dr)).

> +         set_ptr_info_alignment (pi, al, 0);
> +      }
> +
>    /* This exactly matches the pattern recognition in classify_partition.  */
>    val = gimple_assign_rhs1 (DR_STMT (builtin->dst_dr));
>    /* Handle constants like 0x15151515 and similarly
> diff --git a/gcc/tree-ssanames.c b/gcc/tree-ssanames.c
> index 51a26d2fce1c2..c4b5bf2a4999a 100644
> --- a/gcc/tree-ssanames.c
> +++ b/gcc/tree-ssanames.c
> @@ -546,10 +546,29 @@ get_nonzero_bits (const_tree name)
>      }
>
>    range_info_def *ri = SSA_NAME_RANGE_INFO (name);
> +  wide_int ret;
>    if (!ri)
> -    return wi::shwi (-1, precision);
> +    ret = wi::shwi (-1, precision);
> +  else
> +    ret = ri->get_nonzero_bits ();
> +
> +  /* If NAME is defined as a multiple of a constant C, we know the ctz(C) low
> +     bits are zero.  ??? Should we handle LSHIFT_EXPR too?  Non-constants,
> +     e.g. the minimum shift count, and ctz from both MULT_EXPR operands?  That
> +     could make for deep recursion.  */
> +  if (INTEGRAL_TYPE_P (TREE_TYPE (name))
> +      && SSA_NAME_DEF_STMT (name)
> +      && is_gimple_assign (SSA_NAME_DEF_STMT (name))
> +      && gimple_assign_rhs_code (SSA_NAME_DEF_STMT (name)) == MULT_EXPR
> +      && TREE_CODE (gimple_assign_rhs2 (SSA_NAME_DEF_STMT (name))) == INTEGER_CST)
> +    {
> +      unsigned HOST_WIDE_INT bits
> +       = tree_ctz (gimple_assign_rhs2 (SSA_NAME_DEF_STMT (name)));
> +      wide_int mask = wi::shwi (-1, precision) << bits;
> +      ret &= mask;
> +    }
>
> -  return ri->get_nonzero_bits ();
> +  return ret;
>  }

So I wonder whether we should instead re-run CCP after loop opts which
computes nonzero bits as well instead of the above "hack".  Would
nonzero bits be ready to compute in the above way from loop distribution?
That is, you can do set_nonzero_bits just like you did
set_ptr_info_alignment ...

Since CCP also performs copy propagation an obvious candidate would be
to replace the last pass_copy_prop with pass_ccp (with a comment noting
to compute up-to-date nonzero bits info).

Thanks,
Richard.

>  /* Return TRUE is OP, an SSA_NAME has a range of values [0..1], false
>
>
> --
> Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
>    Free Software Activist         GNU Toolchain Engineer
>         Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

Alexandre Oliva Feb. 16, 2021, 4:56 a.m. UTC | #15

On Feb 12, 2021, Richard Biener <richard.guenther@gmail.com> wrote:

>> +  if (TREE_CODE (mem) == SSA_NAME)
>> +    if (ptr_info_def *pi = get_ptr_info (mem))
>> +      {
>> +       unsigned al = get_pointer_alignment (builtin->dst_base);
>> +       if (al > pi->align || pi->misalign)

> We still might prefer pi->align == 64 and pi->misalign == 32 over al == 16
> so maybe factor that in, too.

Ugh, apologies, I somehow posted an incorrect and outdated version of
the patch.  The improved (propagates both alignments) and fixed (divides
by BITS_PER_UNIT, fixing a regression in gfortran.dg/sms-2.f90) had
this alternate hunk as the only difference:

@@ -1155,6 +1156,16 @@ generate_memset_builtin (class loop *loop, partition *partition)
   mem = force_gimple_operand_gsi (&gsi, mem, true, NULL_TREE,
 				  false, GSI_CONTINUE_LINKING);
 
+  if (TREE_CODE (mem) == SSA_NAME)
+    if (ptr_info_def *pi = get_ptr_info (mem))
+      {
+	unsigned al;
+	unsigned HOST_WIDE_INT misal;
+	if (get_pointer_alignment_1 (builtin->dst_base, &al, &misal))
+	  set_ptr_info_alignment (pi, al / BITS_PER_UNIT,
+				  misal / BITS_PER_UNIT);
+      }
+
   /* This exactly matches the pattern recognition in classify_partition.  */
   val = gimple_assign_rhs1 (DR_STMT (builtin->dst_dr));
   /* Handle constants like 0x15151515 and similarly



> So I wonder whether we should instead re-run CCP after loop opts which
> computes nonzero bits as well instead of the above "hack".  Would
> nonzero bits be ready to compute in the above way from loop distribution?
> That is, you can do set_nonzero_bits just like you did
> set_ptr_info_alignment ...

> Since CCP also performs copy propagation an obvious candidate would be
> to replace the last pass_copy_prop with pass_ccp (with a comment noting
> to compute up-to-date nonzero bits info).

I'll look into these possibilities.

Alexandre Oliva Feb. 16, 2021, 10:47 a.m. UTC | #16

On Feb 16, 2021, Alexandre Oliva <oliva@adacore.com> wrote:

>> So I wonder whether we should instead re-run CCP after loop opts which
>> computes nonzero bits as well instead of the above "hack".

That works.  It takes care of both the dest alignment and the len ctz.

Explicitly masking out the len tz from nonzero bits also works, but I
suppose the ccp pass does a better job, and it takes care of memcpy and
memmove as well.

I noticed that ccp releases CDI_DOMINATORS information.  It looks like
the last copy_prop pass, now replaced with ccp in my tree, was late
enough that it doesn't seem to matter, but at first I thought you meant
an earlier copy_prop, that runs before graphite, and dropping dominator
info at that point was a problem for e.g. cunroll.

Richard Biener Feb. 16, 2021, 12:11 p.m. UTC | #17

On Tue, Feb 16, 2021 at 11:48 AM Alexandre Oliva <oliva@adacore.com> wrote:
>
> On Feb 16, 2021, Alexandre Oliva <oliva@adacore.com> wrote:
>
> >> So I wonder whether we should instead re-run CCP after loop opts which
> >> computes nonzero bits as well instead of the above "hack".
>
> That works.  It takes care of both the dest alignment and the len ctz.
>
> Explicitly masking out the len tz from nonzero bits also works, but I
> suppose the ccp pass does a better job, and it takes care of memcpy and
> memmove as well.

It's certainly more general, yes.

> I noticed that ccp releases CDI_DOMINATORS information.  It looks like
> the last copy_prop pass, now replaced with ccp in my tree, was late
> enough that it doesn't seem to matter, but at first I thought you meant
> an earlier copy_prop, that runs before graphite, and dropping dominator
> info at that point was a problem for e.g. cunroll.

We indeed shouldn't release CDI_DOMINATORS info, but if it is a problem
elsewhere then that pass fails to compute dominator info.  But yes,
in the loop pipeline we expect dominator info to be present (but fast
query can disable itself leading to issues if doms are not recomputed).

Richard.

> --
> Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
>    Free Software Activist         GNU Toolchain Engineer
>         Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

[RFC] test builtin ratio for loop distribution

Commit Message

Comments

Patch