diff mbox series

[1/2] middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]

Message ID patch-16909-tamar@arm.com
State New
Headers show
Series [1/2] middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583] | expand

Commit Message

Tamar Christina Feb. 9, 2023, 5:16 p.m. UTC
Hi All,

As discussed in the ticket, this replaces the approach for optimizing the
div by bitmask operation from a hook into optabs implemented through
add_highpart.

In order to be able to use this we need to check whether the current precision
has enough bits to do the operation without any of the additions overflowing.

We use range information to determine this and only do the operation if we're
sure am overflow won't occur.

Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going> issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	PR target/108583
	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Remove.
	* doc/tm.texi.in: Likewise.
	* explow.cc (round_push, align_dynamic_address): Revert previous patch.
	* expmed.cc (expand_divmod): Likewise.
	* expmed.h (expand_divmod): Likewise.
	* expr.cc (force_operand, expand_expr_divmod): Likewise.
	* optabs.cc (expand_doubleword_mod, expand_doubleword_divmod): Likewise.
	* internal-fn.def (ADDH): New.
	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
	* doc/md.texi: Document them.
	* doc/rtl.texi: Likewise.
	* target.def (can_special_div_by_const): Remove.
	* target.h: Remove tree-core.h include
	* targhooks.cc (default_can_special_div_by_const): Remove.
	* targhooks.h (default_can_special_div_by_const): Remove.
	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook and
	implement new obtab recognition based on range.
	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.

gcc/testsuite/ChangeLog:

	PR target/108583
	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
	* gcc.dg/vect/vect-div-bitmask-5.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f7408038595e21af35d 100644




--
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f7408038595e21af35d 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5668,6 +5668,18 @@ represented in RTL using a @code{smul_highpart} RTX expression.
 Similar, but the multiplication is unsigned.  This may be represented
 in RTL using an @code{umul_highpart} RTX expression.
 
+@cindex @code{sadd@var{m}3_highpart} instruction pattern
+@item @samp{smul@var{m}3_highpart}
+Perform a signed addition of operands 1 and 2, which have mode
+@var{m}, and store the most significant half of the product in operand 0.
+The least significant half of the product is discarded.  This may be
+represented in RTL using a @code{sadd_highpart} RTX expression.
+
+@cindex @code{uadd@var{m}3_highpart} instruction pattern
+@item @samp{uadd@var{m}3_highpart}
+Similar, but the addition is unsigned.  This may be represented
+in RTL using an @code{uadd_highpart} RTX expression.
+
 @cindex @code{madd@var{m}@var{n}4} instruction pattern
 @item @samp{madd@var{m}@var{n}4}
 Multiply operands 1 and 2, sign-extend them to mode @var{n}, add
diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
index d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00343d171940ec4222f3 100644
--- a/gcc/doc/rtl.texi
+++ b/gcc/doc/rtl.texi
@@ -2535,6 +2535,17 @@ out in machine mode @var{m}.  @code{smul_highpart} returns the high part
 of a signed multiplication, @code{umul_highpart} returns the high part
 of an unsigned multiplication.
 
+@findex sadd_highpart
+@findex uadd_highpart
+@cindex high-part addition
+@cindex addition high part
+@item (sadd_highpart:@var{m} @var{x} @var{y})
+@itemx (uadd_highpart:@var{m} @var{x} @var{y})
+Represents the high-part addition of @var{x} and @var{y} carried
+out in machine mode @var{m}.  @code{sadd_highpart} returns the high part
+of a signed addition, @code{uadd_highpart} returns the high part
+of an unsigned addition.
+
 @findex fma
 @cindex fused multiply-add
 @item (fma:@var{m} @var{x} @var{y} @var{z})
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d5791484017e6b0d62ab077e 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6137,22 +6137,6 @@ instruction pattern.  There is no need for the hook to handle these two
 implementation approaches itself.
 @end deftypefn
 
-@deftypefn {Target Hook} bool TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum @var{tree_code}, tree @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx @var{in0}, rtx @var{in1})
-This hook is used to test whether the target has a special method of
-division of vectors of type @var{vectype} using the value @var{constant},
-and producing a vector of type @var{vectype}.  The division
-will then not be decomposed by the vectorizer and kept as a div.
-
-When the hook is being used to test whether the target supports a special
-divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook
-is being used to emit a division, @var{in0} and @var{in1} are the source
-vectors of type @var{vecttype} and @var{output} is the destination vector of
-type @var{vectype}.
-
-Return true if the operation is possible, emitting instructions for it
-if rtxes are provided and updating @var{output}.
-@end deftypefn
-
 @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
 This hook should return the decl of a function that implements the
 vectorized variant of the function with the @code{combined_fn} code
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d1efec0a3abccd1c293c7b 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4173,8 +4173,6 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_VEC_PERM_CONST
 
-@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
-
 @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
 
 @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
diff --git a/gcc/explow.cc b/gcc/explow.cc
index 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0befa016eea4573c 100644
--- a/gcc/explow.cc
+++ b/gcc/explow.cc
@@ -1037,7 +1037,7 @@ round_push (rtx size)
      TRUNC_DIV_EXPR.  */
   size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
 		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
-  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, size, align_rtx,
+  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
 			NULL_RTX, 1);
   size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
 
@@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned required_align)
 			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
 				       Pmode),
 			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
-  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, target,
+  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
 			  gen_int_mode (required_align / BITS_PER_UNIT,
 					Pmode),
 			  NULL_RTX, 1);
diff --git a/gcc/expmed.h b/gcc/expmed.h
index 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c53640941628068f3901 100644
--- a/gcc/expmed.h
+++ b/gcc/expmed.h
@@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code, machine_mode, rtx, poly_int64, rtx,
 extern rtx maybe_expand_shift (enum tree_code, machine_mode, rtx, int, rtx,
 			       int);
 #ifdef GCC_OPTABS_H
-extern rtx expand_divmod (int, enum tree_code, machine_mode, tree, tree,
-			  rtx, rtx, rtx, int,
-			  enum optab_methods = OPTAB_LIB_WIDEN);
+extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx, rtx,
+			  rtx, int, enum optab_methods = OPTAB_LIB_WIDEN);
 #endif
 #endif
 
diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3a59c169d3b7692f 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode, rtx op0, HOST_WIDE_INT d)
 
 rtx
 expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
-	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
-	       int unsignedp, enum optab_methods methods)
+	       rtx op0, rtx op1, rtx target, int unsignedp,
+	       enum optab_methods methods)
 {
   machine_mode compute_mode;
   rtx tquotient;
@@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
 
   last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) : 0;
 
-  /* Check if the target has specific expansions for the division.  */
-  tree cst;
-  if (treeop0
-      && treeop1
-      && (cst = uniform_integer_cst_p (treeop1))
-      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE (treeop0),
-						     wi::to_wide (cst),
-						     &target, op0, op1))
-    return target;
-
-
   /* Now convert to the best mode to use.  */
   if (compute_mode != mode)
     {
@@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
 			    || (optab_handler (sdivmod_optab, int_mode)
 				!= CODE_FOR_nothing)))
 		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
-						int_mode, treeop0, treeop1,
-						op0, gen_int_mode (abs_d,
+						int_mode, op0,
+						gen_int_mode (abs_d,
 							      int_mode),
 						NULL_RTX, 0);
 		    else
@@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
 				      size - 1, NULL_RTX, 0);
 		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
 				    NULL_RTX);
-		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, treeop0,
-				    treeop1, t3, op1, NULL_RTX, 0);
+		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3, op1,
+				    NULL_RTX, 0);
 		if (t4)
 		  {
 		    rtx t5;
diff --git a/gcc/expr.cc b/gcc/expr.cc
index 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b2280c6e277f26d72 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
 	    return expand_divmod (0,
 				  FLOAT_MODE_P (GET_MODE (value))
 				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
-				  GET_MODE (value), NULL, NULL, op1, op2,
-				  target, 0);
+				  GET_MODE (value), op1, op2, target, 0);
 	case MOD:
-	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
-				op1, op2, target, 0);
+	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
+				target, 0);
 	case UDIV:
-	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), NULL, NULL,
-				op1, op2, target, 1);
+	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), op1, op2,
+				target, 1);
 	case UMOD:
-	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
-				op1, op2, target, 1);
+	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
+				target, 1);
 	case ASHIFTRT:
 	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
 				      target, 0, OPTAB_LIB_WIDEN);
@@ -9170,13 +9169,11 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
       bool speed_p = optimize_insn_for_speed_p ();
       do_pending_stack_adjust ();
       start_sequence ();
-      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
-				   op0, op1, target, 1);
+      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 1);
       rtx_insn *uns_insns = get_insns ();
       end_sequence ();
       start_sequence ();
-      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
-				   op0, op1, target, 0);
+      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 0);
       rtx_insn *sgn_insns = get_insns ();
       end_sequence ();
       unsigned uns_cost = seq_cost (uns_insns, speed_p);
@@ -9198,8 +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
       emit_insn (sgn_insns);
       return sgn_ret;
     }
-  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
-			op0, op1, target, unsignedp);
+  return expand_divmod (mod_p, code, mode, op0, op1, target, unsignedp);
 }
 
 rtx
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052584f5a3b8a734baa800f 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL, ECF_CONST | ECF_NOTHROW, first,
 
 DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST | ECF_NOTHROW, first,
 			      smul_highpart, umul_highpart, binary)
+DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST | ECF_NOTHROW, first,
+			      sadd_highpart, uadd_highpart, binary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST | ECF_NOTHROW, first,
 			      smulhs, umulhs, binary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | ECF_NOTHROW, first,
diff --git a/gcc/optabs.cc b/gcc/optabs.cc
index cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369a6e77082c1e617b 100644
--- a/gcc/optabs.cc
+++ b/gcc/optabs.cc
@@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode mode, rtx op0, rtx op1, bool unsignedp)
 		return NULL_RTX;
 	    }
 	}
-      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, NULL, NULL,
-				     sum, gen_int_mode (INTVAL (op1),
-							word_mode),
+      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, sum,
+				     gen_int_mode (INTVAL (op1), word_mode),
 				     NULL_RTX, 1, OPTAB_DIRECT);
       if (remainder == NULL_RTX)
 	return NULL_RTX;
@@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
 
   if (op11 != const1_rtx)
     {
-      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL, NULL, quot1,
-				op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
+      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1, op11,
+				NULL_RTX, unsignedp, OPTAB_DIRECT);
       if (rem2 == NULL_RTX)
 	return NULL_RTX;
 
@@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
       if (rem2 == NULL_RTX)
 	return NULL_RTX;
 
-      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL, NULL, quot1,
-				 op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
+      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1, op11,
+				 NULL_RTX, unsignedp, OPTAB_DIRECT);
       if (quot2 == NULL_RTX)
 	return NULL_RTX;
 
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2d7a5ccbf6147947351a 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
 
 OPTAB_D (smul_highpart_optab, "smul$a3_highpart")
 OPTAB_D (umul_highpart_optab, "umul$a3_highpart")
+OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart")
+OPTAB_D (uadd_highpart_optab, "uadd$a3_highpart")
 
 OPTAB_D (cmpmem_optab, "cmpmem$a")
 OPTAB_D (cmpstr_optab, "cmpstr$a")
diff --git a/gcc/target.def b/gcc/target.def
index db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1905,25 +1905,6 @@ implementation approaches itself.",
 	const vec_perm_indices &sel),
  NULL)
 
-DEFHOOK
-(can_special_div_by_const,
- "This hook is used to test whether the target has a special method of\n\
-division of vectors of type @var{vectype} using the value @var{constant},\n\
-and producing a vector of type @var{vectype}.  The division\n\
-will then not be decomposed by the vectorizer and kept as a div.\n\
-\n\
-When the hook is being used to test whether the target supports a special\n\
-divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook\n\
-is being used to emit a division, @var{in0} and @var{in1} are the source\n\
-vectors of type @var{vecttype} and @var{output} is the destination vector of\n\
-type @var{vectype}.\n\
-\n\
-Return true if the operation is possible, emitting instructions for it\n\
-if rtxes are provided and updating @var{output}.",
- bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
-	rtx in0, rtx in1),
- default_can_special_div_by_const)
-
 /* Return true if the target supports misaligned store/load of a
    specific factor denoted in the third parameter.  The last parameter
    is true if the access is defined in a packed struct.  */
diff --git a/gcc/target.h b/gcc/target.h
index 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b99f913158c2d47b1 100644
--- a/gcc/target.h
+++ b/gcc/target.h
@@ -51,7 +51,6 @@
 #include "insn-codes.h"
 #include "tm.h"
 #include "hard-reg-set.h"
-#include "tree-core.h"
 
 #if CHECKING_P
 
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad2244549317a31390f0c2 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage (addr_space_t, location_t);
 extern rtx default_addr_space_convert (rtx, tree, tree);
 extern unsigned int default_case_values_threshold (void);
 extern bool default_have_conditional_execution (void);
-extern bool default_can_special_div_by_const (enum tree_code, tree, wide_int,
-					      rtx *, rtx, rtx);
 
 extern bool default_libc_has_function (enum function_class, tree);
 extern bool default_libc_has_fast_function (int fcode);
diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
index fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91e03877337a931e7 100644
--- a/gcc/targhooks.cc
+++ b/gcc/targhooks.cc
@@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
   return HAVE_conditional_execution;
 }
 
-/* Default that no division by constant operations are special.  */
-bool
-default_can_special_div_by_const (enum tree_code, tree, wide_int, rtx *, rtx,
-				  rtx)
-{
-  return false;
-}
-
 /* By default we assume that c99 functions are present at the runtime,
    but sincos is not.  */
 bool
diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0a04ea8c1f73e3c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
@@ -0,0 +1,25 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdint.h>
+#include "tree-vect.h"
+
+typedef unsigned __attribute__((__vector_size__ (16))) V;
+
+static __attribute__((__noinline__)) __attribute__((__noclone__)) V
+foo (V v, unsigned short i)
+{
+  v /= i;
+  return v;
+}
+
+int
+main (void)
+{
+  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }, 0xffff);
+  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
+    if (v[i] != 0x00010001)
+      __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern: detected" "vect" { target aarch64*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b49914d2a29b933de625
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
@@ -0,0 +1,58 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdint.h>
+#include <stdio.h>
+#include "tree-vect.h"
+
+#define N 50
+#define TYPE uint8_t 
+
+#ifndef DEBUG
+#define DEBUG 0
+#endif
+
+#define BASE ((TYPE) -1 < 0 ? -126 : 4)
+
+
+__attribute__((noipa, noinline, optimize("O1")))
+void fun1(TYPE* restrict pixel, TYPE level, int n)
+{
+  for (int i = 0; i < n; i+=1)
+    pixel[i] = (pixel[i] + level) / 0xff;
+}
+
+__attribute__((noipa, noinline, optimize("O3")))
+void fun2(TYPE* restrict pixel, TYPE level, int n)
+{
+  for (int i = 0; i < n; i+=1)
+    pixel[i] = (pixel[i] + level) / 0xff;
+}
+
+int main ()
+{
+  TYPE a[N];
+  TYPE b[N];
+
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE + i * 13;
+      b[i] = BASE + i * 13;
+      if (DEBUG)
+        printf ("%d: 0x%x\n", i, a[i]);
+    }
+
+  fun1 (a, N / 2, N);
+  fun2 (b, N / 2, N);
+
+  for (int i = 0; i < N; ++i)
+    {
+      if (DEBUG)
+        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
+
+      if (a[i] != b[i])
+        __builtin_abort ();
+    }
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" { target aarch64*-*-* } } } */
diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
index 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077dc3e970bed75ef6 100644
--- a/gcc/tree-vect-generic.cc
+++ b/gcc/tree-vect-generic.cc
@@ -1237,17 +1237,6 @@ expand_vector_operation (gimple_stmt_iterator *gsi, tree type, tree compute_type
 	  tree rhs2 = gimple_assign_rhs2 (assign);
 	  tree ret;
 
-	  /* Check if the target was going to handle it through the special
-	     division callback hook.  */
-	  tree cst = uniform_integer_cst_p (rhs2);
-	  if (cst &&
-	      targetm.vectorize.can_special_div_by_const (code, type,
-							  wi::to_wide (cst),
-							  NULL,
-							  NULL_RTX, NULL_RTX))
-	    return NULL_TREE;
-
-
 	  if (!optimize
 	      || !VECTOR_INTEGER_TYPE_P (type)
 	      || TREE_CODE (rhs2) != VECTOR_CST
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f3369de2afea139d6 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info *vinfo,
       return pattern_stmt;
     }
   else if ((cst = uniform_integer_cst_p (oprnd1))
-	   && targetm.vectorize.can_special_div_by_const (rhs_code, vectype,
-							  wi::to_wide (cst),
-							  NULL, NULL_RTX,
-							  NULL_RTX))
+	   && TYPE_UNSIGNED (itype)
+	   && rhs_code == TRUNC_DIV_EXPR
+	   && vectype
+	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
+					      OPTIMIZE_FOR_SPEED))
     {
-      return NULL;
+      /* div optimizations using narrowings
+       we can do the division e.g. shorts by 255 faster by calculating it as
+       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
+       double the precision of x.
+
+       If we imagine a short as being composed of two blocks of bytes then
+       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
+       adding 1 to each sub component:
+
+	    short value of 16-bits
+       ┌──────────────┬────────────────┐
+       │              │                │
+       └──────────────┴────────────────┘
+	 8-bit part1 ▲  8-bit part2   ▲
+		     │                │
+		     │                │
+		    +1               +1
+
+       after the first addition, we have to shift right by 8, and narrow the
+       results back to a byte.  Remember that the addition must be done in
+       double the precision of the input.  However if we know that the addition
+       `x + 257` does not overflow then we can do the operation in the current
+       precision.  In which case we don't need the pack and unpacks.  */
+      auto wcst = wi::to_wide (cst);
+      int pow = wi::exact_log2 (wcst + 1);
+      if (pow == (int) (element_precision (vectype) / 2))
+	{
+	  wide_int min,max;
+	  /* If we're in a pattern we need to find the orginal definition.  */
+	  tree op0 = oprnd0;
+	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
+	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
+	  if (is_pattern_stmt_p (stmt_info))
+	    {
+	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
+	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
+		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
+	    }
+
+	  /* Check that no overflow will occur.  If we don't have range
+	     information we can't perform the optimization.  */
+	  if (vect_get_range_info (op0, &min, &max))
+	    {
+	      wide_int one = wi::to_wide (build_one_cst (itype));
+	      wide_int adder = wi::add (one, wi::lshift (one, pow));
+	      wi::overflow_type ovf;
+	      /* We need adder and max in the same precision.  */
+	      wide_int zadder
+		= wide_int_storage::from (adder, wi::get_precision (max),
+					  UNSIGNED);
+	      wi::add (max, zadder, UNSIGNED, &ovf);
+	      if (ovf == wi::OVF_NONE)
+		{
+		  *type_out = vectype;
+		  tree tadder = wide_int_to_tree (itype, adder);
+		  gcall *patt1
+		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, tadder);
+		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
+		  gimple_call_set_lhs (patt1, lhs);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  pattern_stmt
+		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
+		  lhs = vect_recog_temp_ssa_var (itype, NULL);
+		  gimple_call_set_lhs (pattern_stmt, lhs);
+
+		  return pattern_stmt;
+		}
+	    }
+	}
     }
 
   if (prec > HOST_BITS_PER_WIDE_INT
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b9564fc4e066e50081 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
 	}
       target_support_p = (optab_handler (optab, vec_mode)
 			  != CODE_FOR_nothing);
-      tree cst;
-      if (!target_support_p
-	  && op1
-	  && (cst = uniform_integer_cst_p (op1)))
-	target_support_p
-	  = targetm.vectorize.can_special_div_by_const (code, vectype,
-							wi::to_wide (cst),
-							NULL, NULL_RTX,
-							NULL_RTX);
     }
 
   bool using_emulated_vectors_p = vect_emulated_vector_p (vectype);

Comments

Tamar Christina Feb. 10, 2023, 10:34 a.m. UTC | #1
Oops, realizes I forgot to fill in the test results, there were no issues 😊

> -----Original Message-----
> From: Gcc-patches <gcc-patches-
> bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Tamar
> Christina via Gcc-patches
> Sent: Thursday, February 9, 2023 5:17 PM
> To: gcc-patches@gcc.gnu.org
> Cc: nd <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> Subject: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by
> using new optabs [PR108583]
> 
> Hi All,
> 
> As discussed in the ticket, this replaces the approach for optimizing the div by
> bitmask operation from a hook into optabs implemented through
> add_highpart.
> 
> In order to be able to use this we need to check whether the current
> precision has enough bits to do the operation without any of the additions
> overflowing.
> 
> We use range information to determine this and only do the operation if
> we're sure am overflow won't occur.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going> issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	PR target/108583
> 	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST):
> Remove.
> 	* doc/tm.texi.in: Likewise.
> 	* explow.cc (round_push, align_dynamic_address): Revert previous
> patch.
> 	* expmed.cc (expand_divmod): Likewise.
> 	* expmed.h (expand_divmod): Likewise.
> 	* expr.cc (force_operand, expand_expr_divmod): Likewise.
> 	* optabs.cc (expand_doubleword_mod,
> expand_doubleword_divmod): Likewise.
> 	* internal-fn.def (ADDH): New.
> 	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
> 	* doc/md.texi: Document them.
> 	* doc/rtl.texi: Likewise.
> 	* target.def (can_special_div_by_const): Remove.
> 	* target.h: Remove tree-core.h include
> 	* targhooks.cc (default_can_special_div_by_const): Remove.
> 	* targhooks.h (default_can_special_div_by_const): Remove.
> 	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
> 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook
> and
> 	implement new obtab recognition based on range.
> 	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.
> 
> gcc/testsuite/ChangeLog:
> 
> 	PR target/108583
> 	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
> 	* gcc.dg/vect/vect-div-bitmask-5.c: New test.
> 
> --- inline copy of patch --
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index
> 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f74080
> 38595e21af35d 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5668,6 +5668,18 @@ represented in RTL using a @code{smul_highpart}
> RTX expression.
>  Similar, but the multiplication is unsigned.  This may be represented  in RTL
> using an @code{umul_highpart} RTX expression.
> 
> +@cindex @code{sadd@var{m}3_highpart} instruction pattern @item
> +@samp{smul@var{m}3_highpart} Perform a signed addition of operands 1
> +and 2, which have mode @var{m}, and store the most significant half of
> +the product in operand 0.
> +The least significant half of the product is discarded.  This may be
> +represented in RTL using a @code{sadd_highpart} RTX expression.
> +
> +@cindex @code{uadd@var{m}3_highpart} instruction pattern @item
> +@samp{uadd@var{m}3_highpart} Similar, but the addition is unsigned.
> +This may be represented in RTL using an @code{uadd_highpart} RTX
> +expression.
> +
>  @cindex @code{madd@var{m}@var{n}4} instruction pattern  @item
> @samp{madd@var{m}@var{n}4}  Multiply operands 1 and 2, sign-extend
> them to mode @var{n}, add diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
> index
> d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00343
> d171940ec4222f3 100644
> --- a/gcc/doc/rtl.texi
> +++ b/gcc/doc/rtl.texi
> @@ -2535,6 +2535,17 @@ out in machine mode @var{m}.
> @code{smul_highpart} returns the high part  of a signed multiplication,
> @code{umul_highpart} returns the high part  of an unsigned multiplication.
> 
> +@findex sadd_highpart
> +@findex uadd_highpart
> +@cindex high-part addition
> +@cindex addition high part
> +@item (sadd_highpart:@var{m} @var{x} @var{y}) @itemx
> +(uadd_highpart:@var{m} @var{x} @var{y}) Represents the high-part
> +addition of @var{x} and @var{y} carried out in machine mode @var{m}.
> +@code{sadd_highpart} returns the high part of a signed addition,
> +@code{uadd_highpart} returns the high part of an unsigned addition.
> +
>  @findex fma
>  @cindex fused multiply-add
>  @item (fma:@var{m} @var{x} @var{y} @var{z}) diff --git a/gcc/doc/tm.texi
> b/gcc/doc/tm.texi index
> c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d57914840
> 17e6b0d62ab077e 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -6137,22 +6137,6 @@ instruction pattern.  There is no need for the hook
> to handle these two  implementation approaches itself.
>  @end deftypefn
> 
> -@deftypefn {Target Hook} bool
> TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum
> @var{tree_code}, tree @var{vectype}, wide_int @var{constant}, rtx
> *@var{output}, rtx @var{in0}, rtx @var{in1}) -This hook is used to test
> whether the target has a special method of -division of vectors of type
> @var{vectype} using the value @var{constant}, -and producing a vector of
> type @var{vectype}.  The division -will then not be decomposed by the
> vectorizer and kept as a div.
> -
> -When the hook is being used to test whether the target supports a special -
> divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook -
> is being used to emit a division, @var{in0} and @var{in1} are the source -
> vectors of type @var{vecttype} and @var{output} is the destination vector of
> -type @var{vectype}.
> -
> -Return true if the operation is possible, emitting instructions for it -if rtxes
> are provided and updating @var{output}.
> -@end deftypefn
> -
>  @deftypefn {Target Hook} tree
> TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned
> @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})  This hook
> should return the decl of a function that implements the  vectorized variant
> of the function with the @code{combined_fn} code diff --git
> a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index
> 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d1efec0
> a3abccd1c293c7b 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -4173,8 +4173,6 @@ address;  but often a machine-dependent strategy
> can generate better code.
> 
>  @hook TARGET_VECTORIZE_VEC_PERM_CONST
> 
> -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
> -
>  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
> 
>  @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
> diff --git a/gcc/explow.cc b/gcc/explow.cc index
> 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0
> befa016eea4573c 100644
> --- a/gcc/explow.cc
> +++ b/gcc/explow.cc
> @@ -1037,7 +1037,7 @@ round_push (rtx size)
>       TRUNC_DIV_EXPR.  */
>    size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
>  		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
> -  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, size,
> align_rtx,
> +  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
>  			NULL_RTX, 1);
>    size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
> 
> @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned
> required_align)
>  			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
>  				       Pmode),
>  			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
> -  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
> target,
> +  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
>  			  gen_int_mode (required_align / BITS_PER_UNIT,
>  					Pmode),
>  			  NULL_RTX, 1);
> diff --git a/gcc/expmed.h b/gcc/expmed.h index
> 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c5364
> 0941628068f3901 100644
> --- a/gcc/expmed.h
> +++ b/gcc/expmed.h
> @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code,
> machine_mode, rtx, poly_int64, rtx,  extern rtx maybe_expand_shift (enum
> tree_code, machine_mode, rtx, int, rtx,
>  			       int);
>  #ifdef GCC_OPTABS_H
> -extern rtx expand_divmod (int, enum tree_code, machine_mode, tree,
> tree,
> -			  rtx, rtx, rtx, int,
> -			  enum optab_methods = OPTAB_LIB_WIDEN);
> +extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx, rtx,
> +			  rtx, int, enum optab_methods =
> OPTAB_LIB_WIDEN);
>  #endif
>  #endif
> 
> diff --git a/gcc/expmed.cc b/gcc/expmed.cc index
> 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3
> a59c169d3b7692f 100644
> --- a/gcc/expmed.cc
> +++ b/gcc/expmed.cc
> @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode, rtx
> op0, HOST_WIDE_INT d)
> 
>  rtx
>  expand_divmod (int rem_flag, enum tree_code code, machine_mode
> mode,
> -	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
> -	       int unsignedp, enum optab_methods methods)
> +	       rtx op0, rtx op1, rtx target, int unsignedp,
> +	       enum optab_methods methods)
>  {
>    machine_mode compute_mode;
>    rtx tquotient;
> @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum tree_code
> code, machine_mode mode,
> 
>    last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) : 0;
> 
> -  /* Check if the target has specific expansions for the division.  */
> -  tree cst;
> -  if (treeop0
> -      && treeop1
> -      && (cst = uniform_integer_cst_p (treeop1))
> -      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE
> (treeop0),
> -						     wi::to_wide (cst),
> -						     &target, op0, op1))
> -    return target;
> -
> -
>    /* Now convert to the best mode to use.  */
>    if (compute_mode != mode)
>      {
> @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum tree_code
> code, machine_mode mode,
>  			    || (optab_handler (sdivmod_optab, int_mode)
>  				!= CODE_FOR_nothing)))
>  		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
> -						int_mode, treeop0, treeop1,
> -						op0, gen_int_mode (abs_d,
> +						int_mode, op0,
> +						gen_int_mode (abs_d,
>  							      int_mode),
>  						NULL_RTX, 0);
>  		    else
> @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum tree_code
> code, machine_mode mode,
>  				      size - 1, NULL_RTX, 0);
>  		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
>  				    NULL_RTX);
> -		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode,
> treeop0,
> -				    treeop1, t3, op1, NULL_RTX, 0);
> +		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3,
> op1,
> +				    NULL_RTX, 0);
>  		if (t4)
>  		  {
>  		    rtx t5;
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index
> 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b
> 2280c6e277f26d72 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
>  	    return expand_divmod (0,
>  				  FLOAT_MODE_P (GET_MODE (value))
>  				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
> -				  GET_MODE (value), NULL, NULL, op1, op2,
> -				  target, 0);
> +				  GET_MODE (value), op1, op2, target, 0);
>  	case MOD:
> -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> NULL, NULL,
> -				op1, op2, target, 0);
> +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> op1, op2,
> +				target, 0);
>  	case UDIV:
> -	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
> NULL, NULL,
> -				op1, op2, target, 1);
> +	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
> op1, op2,
> +				target, 1);
>  	case UMOD:
> -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> NULL, NULL,
> -				op1, op2, target, 1);
> +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> op1, op2,
> +				target, 1);
>  	case ASHIFTRT:
>  	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
>  				      target, 0, OPTAB_LIB_WIDEN);
> @@ -9170,13 +9169,11 @@ expand_expr_divmod (tree_code code,
> machine_mode mode, tree treeop0,
>        bool speed_p = optimize_insn_for_speed_p ();
>        do_pending_stack_adjust ();
>        start_sequence ();
> -      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -				   op0, op1, target, 1);
> +      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1, target,
> + 1);
>        rtx_insn *uns_insns = get_insns ();
>        end_sequence ();
>        start_sequence ();
> -      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -				   op0, op1, target, 0);
> +      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1, target,
> + 0);
>        rtx_insn *sgn_insns = get_insns ();
>        end_sequence ();
>        unsigned uns_cost = seq_cost (uns_insns, speed_p); @@ -9198,8 +9195,7
> @@ expand_expr_divmod (tree_code code, machine_mode mode, tree
> treeop0,
>        emit_insn (sgn_insns);
>        return sgn_ret;
>      }
> -  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -			op0, op1, target, unsignedp);
> +  return expand_divmod (mod_p, code, mode, op0, op1, target,
> + unsignedp);
>  }
> 
>  rtx
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
> 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052584f5a
> 3b8a734baa800f 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL,
> ECF_CONST | ECF_NOTHROW, first,
> 
>  DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST | ECF_NOTHROW,
> first,
>  			      smul_highpart, umul_highpart, binary)
> +DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST | ECF_NOTHROW,
> first,
> +			      sadd_highpart, uadd_highpart, binary)
>  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST |
> ECF_NOTHROW, first,
>  			      smulhs, umulhs, binary)
>  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST |
> ECF_NOTHROW, first, diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
> cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369a6
> e77082c1e617b 100644
> --- a/gcc/optabs.cc
> +++ b/gcc/optabs.cc
> @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode mode,
> rtx op0, rtx op1, bool unsignedp)
>  		return NULL_RTX;
>  	    }
>  	}
> -      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode,
> NULL, NULL,
> -				     sum, gen_int_mode (INTVAL (op1),
> -							word_mode),
> +      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode,
> sum,
> +				     gen_int_mode (INTVAL (op1),
> word_mode),
>  				     NULL_RTX, 1, OPTAB_DIRECT);
>        if (remainder == NULL_RTX)
>  	return NULL_RTX;
> @@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode
> mode, rtx op0, rtx op1, rtx *rem,
> 
>    if (op11 != const1_rtx)
>      {
> -      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL, NULL,
> quot1,
> -				op11, NULL_RTX, unsignedp,
> OPTAB_DIRECT);
> +      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1, op11,
> +				NULL_RTX, unsignedp, OPTAB_DIRECT);
>        if (rem2 == NULL_RTX)
>  	return NULL_RTX;
> 
> @@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode
> mode, rtx op0, rtx op1, rtx *rem,
>        if (rem2 == NULL_RTX)
>  	return NULL_RTX;
> 
> -      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL, NULL,
> quot1,
> -				 op11, NULL_RTX, unsignedp,
> OPTAB_DIRECT);
> +      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1, op11,
> +				 NULL_RTX, unsignedp, OPTAB_DIRECT);
>        if (quot2 == NULL_RTX)
>  	return NULL_RTX;
> 
> diff --git a/gcc/optabs.def b/gcc/optabs.def index
> 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2d7a5
> ccbf6147947351a 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
> 
>  OPTAB_D (smul_highpart_optab, "smul$a3_highpart")  OPTAB_D
> (umul_highpart_optab, "umul$a3_highpart")
> +OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart") OPTAB_D
> +(uadd_highpart_optab, "uadd$a3_highpart")
> 
>  OPTAB_D (cmpmem_optab, "cmpmem$a")
>  OPTAB_D (cmpstr_optab, "cmpstr$a")
> diff --git a/gcc/target.def b/gcc/target.def index
> db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d
> 81afa2c2baa64a5 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -1905,25 +1905,6 @@ implementation approaches itself.",
>  	const vec_perm_indices &sel),
>   NULL)
> 
> -DEFHOOK
> -(can_special_div_by_const,
> - "This hook is used to test whether the target has a special method of\n\ -
> division of vectors of type @var{vectype} using the value @var{constant},\n\
> -and producing a vector of type @var{vectype}.  The division\n\ -will then
> not be decomposed by the vectorizer and kept as a div.\n\ -\n\ -When the
> hook is being used to test whether the target supports a special\n\ -divide,
> @var{in0}, @var{in1}, and @var{output} are all null.  When the hook\n\ -is
> being used to emit a division, @var{in0} and @var{in1} are the source\n\ -
> vectors of type @var{vecttype} and @var{output} is the destination vector
> of\n\ -type @var{vectype}.\n\ -\n\ -Return true if the operation is possible,
> emitting instructions for it\n\ -if rtxes are provided and updating
> @var{output}.",
> - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
> -	rtx in0, rtx in1),
> - default_can_special_div_by_const)
> -
>  /* Return true if the target supports misaligned store/load of a
>     specific factor denoted in the third parameter.  The last parameter
>     is true if the access is defined in a packed struct.  */ diff --git a/gcc/target.h
> b/gcc/target.h index
> 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b9
> 9f913158c2d47b1 100644
> --- a/gcc/target.h
> +++ b/gcc/target.h
> @@ -51,7 +51,6 @@
>  #include "insn-codes.h"
>  #include "tm.h"
>  #include "hard-reg-set.h"
> -#include "tree-core.h"
> 
>  #if CHECKING_P
> 
> diff --git a/gcc/targhooks.h b/gcc/targhooks.h index
> a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad224454
> 9317a31390f0c2 100644
> --- a/gcc/targhooks.h
> +++ b/gcc/targhooks.h
> @@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage
> (addr_space_t, location_t);  extern rtx default_addr_space_convert (rtx,
> tree, tree);  extern unsigned int default_case_values_threshold (void);
> extern bool default_have_conditional_execution (void); -extern bool
> default_can_special_div_by_const (enum tree_code, tree, wide_int,
> -					      rtx *, rtx, rtx);
> 
>  extern bool default_libc_has_function (enum function_class, tree);  extern
> bool default_libc_has_fast_function (int fcode); diff --git a/gcc/targhooks.cc
> b/gcc/targhooks.cc index
> fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91e
> 03877337a931e7 100644
> --- a/gcc/targhooks.cc
> +++ b/gcc/targhooks.cc
> @@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
>    return HAVE_conditional_execution;
>  }
> 
> -/* Default that no division by constant operations are special.  */ -bool -
> default_can_special_div_by_const (enum tree_code, tree, wide_int, rtx *,
> rtx,
> -				  rtx)
> -{
> -  return false;
> -}
> -
>  /* By default we assume that c99 functions are present at the runtime,
>     but sincos is not.  */
>  bool
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0
> a04ea8c1f73e3c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> @@ -0,0 +1,25 @@
> +/* { dg-require-effective-target vect_int } */
> +
> +#include <stdint.h>
> +#include "tree-vect.h"
> +
> +typedef unsigned __attribute__((__vector_size__ (16))) V;
> +
> +static __attribute__((__noinline__)) __attribute__((__noclone__)) V foo
> +(V v, unsigned short i) {
> +  v /= i;
> +  return v;
> +}
> +
> +int
> +main (void)
> +{
> +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff },
> +0xffff);
> +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
> +    if (v[i] != 0x00010001)
> +      __builtin_abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern:
> +detected" "vect" { target aarch64*-*-* } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b4991
> 4d2a29b933de625
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> @@ -0,0 +1,58 @@
> +/* { dg-require-effective-target vect_int } */
> +
> +#include <stdint.h>
> +#include <stdio.h>
> +#include "tree-vect.h"
> +
> +#define N 50
> +#define TYPE uint8_t
> +
> +#ifndef DEBUG
> +#define DEBUG 0
> +#endif
> +
> +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> +
> +
> +__attribute__((noipa, noinline, optimize("O1"))) void fun1(TYPE*
> +restrict pixel, TYPE level, int n) {
> +  for (int i = 0; i < n; i+=1)
> +    pixel[i] = (pixel[i] + level) / 0xff; }
> +
> +__attribute__((noipa, noinline, optimize("O3"))) void fun2(TYPE*
> +restrict pixel, TYPE level, int n) {
> +  for (int i = 0; i < n; i+=1)
> +    pixel[i] = (pixel[i] + level) / 0xff; }
> +
> +int main ()
> +{
> +  TYPE a[N];
> +  TYPE b[N];
> +
> +  for (int i = 0; i < N; ++i)
> +    {
> +      a[i] = BASE + i * 13;
> +      b[i] = BASE + i * 13;
> +      if (DEBUG)
> +        printf ("%d: 0x%x\n", i, a[i]);
> +    }
> +
> +  fun1 (a, N / 2, N);
> +  fun2 (b, N / 2, N);
> +
> +  for (int i = 0; i < N; ++i)
> +    {
> +      if (DEBUG)
> +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> +
> +      if (a[i] != b[i])
> +        __builtin_abort ();
> +    }
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" {
> +target aarch64*-*-* } } } */
> diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc index
> 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077d
> c3e970bed75ef6 100644
> --- a/gcc/tree-vect-generic.cc
> +++ b/gcc/tree-vect-generic.cc
> @@ -1237,17 +1237,6 @@ expand_vector_operation (gimple_stmt_iterator
> *gsi, tree type, tree compute_type
>  	  tree rhs2 = gimple_assign_rhs2 (assign);
>  	  tree ret;
> 
> -	  /* Check if the target was going to handle it through the special
> -	     division callback hook.  */
> -	  tree cst = uniform_integer_cst_p (rhs2);
> -	  if (cst &&
> -	      targetm.vectorize.can_special_div_by_const (code, type,
> -							  wi::to_wide (cst),
> -							  NULL,
> -							  NULL_RTX,
> NULL_RTX))
> -	    return NULL_TREE;
> -
> -
>  	  if (!optimize
>  	      || !VECTOR_INTEGER_TYPE_P (type)
>  	      || TREE_CODE (rhs2) != VECTOR_CST diff --git a/gcc/tree-vect-
> patterns.cc b/gcc/tree-vect-patterns.cc index
> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
> 69de2afea139d6 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info *vinfo,
>        return pattern_stmt;
>      }
>    else if ((cst = uniform_integer_cst_p (oprnd1))
> -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
> vectype,
> -							  wi::to_wide (cst),
> -							  NULL, NULL_RTX,
> -							  NULL_RTX))
> +	   && TYPE_UNSIGNED (itype)
> +	   && rhs_code == TRUNC_DIV_EXPR
> +	   && vectype
> +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> +					      OPTIMIZE_FOR_SPEED))
>      {
> -      return NULL;
> +      /* div optimizations using narrowings
> +       we can do the division e.g. shorts by 255 faster by calculating it as
> +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> +       double the precision of x.
> +
> +       If we imagine a short as being composed of two blocks of bytes then
> +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
> +       adding 1 to each sub component:
> +
> +	    short value of 16-bits
> +       ┌──────────────┬────────────────┐
> +       │              │                │
> +       └──────────────┴────────────────┘
> +	 8-bit part1 ▲  8-bit part2   ▲
> +		     │                │
> +		     │                │
> +		    +1               +1
> +
> +       after the first addition, we have to shift right by 8, and narrow the
> +       results back to a byte.  Remember that the addition must be done in
> +       double the precision of the input.  However if we know that the addition
> +       `x + 257` does not overflow then we can do the operation in the current
> +       precision.  In which case we don't need the pack and unpacks.  */
> +      auto wcst = wi::to_wide (cst);
> +      int pow = wi::exact_log2 (wcst + 1);
> +      if (pow == (int) (element_precision (vectype) / 2))
> +	{
> +	  wide_int min,max;
> +	  /* If we're in a pattern we need to find the orginal definition.  */
> +	  tree op0 = oprnd0;
> +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> +	  if (is_pattern_stmt_p (stmt_info))
> +	    {
> +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
> +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
> +	    }
> +
> +	  /* Check that no overflow will occur.  If we don't have range
> +	     information we can't perform the optimization.  */
> +	  if (vect_get_range_info (op0, &min, &max))
> +	    {
> +	      wide_int one = wi::to_wide (build_one_cst (itype));
> +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> +	      wi::overflow_type ovf;
> +	      /* We need adder and max in the same precision.  */
> +	      wide_int zadder
> +		= wide_int_storage::from (adder, wi::get_precision (max),
> +					  UNSIGNED);
> +	      wi::add (max, zadder, UNSIGNED, &ovf);
> +	      if (ovf == wi::OVF_NONE)
> +		{
> +		  *type_out = vectype;
> +		  tree tadder = wide_int_to_tree (itype, adder);
> +		  gcall *patt1
> +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0,
> tadder);
> +		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
> +		  gimple_call_set_lhs (patt1, lhs);
> +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1,
> vectype);
> +
> +		  pattern_stmt
> +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
> +		  lhs = vect_recog_temp_ssa_var (itype, NULL);
> +		  gimple_call_set_lhs (pattern_stmt, lhs);
> +
> +		  return pattern_stmt;
> +		}
> +	    }
> +	}
>      }
> 
>    if (prec > HOST_BITS_PER_WIDE_INT
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index
> eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b95
> 64fc4e066e50081 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
>  	}
>        target_support_p = (optab_handler (optab, vec_mode)
>  			  != CODE_FOR_nothing);
> -      tree cst;
> -      if (!target_support_p
> -	  && op1
> -	  && (cst = uniform_integer_cst_p (op1)))
> -	target_support_p
> -	  = targetm.vectorize.can_special_div_by_const (code, vectype,
> -							wi::to_wide (cst),
> -							NULL, NULL_RTX,
> -							NULL_RTX);
>      }
> 
>    bool using_emulated_vectors_p = vect_emulated_vector_p (vectype);
> 
> 
> 
> 
> --
Richard Biener Feb. 10, 2023, 1:13 p.m. UTC | #2
On Thu, 9 Feb 2023, Tamar Christina wrote:

> Hi All,
> 
> As discussed in the ticket, this replaces the approach for optimizing the
> div by bitmask operation from a hook into optabs implemented through
> add_highpart.
> 
> In order to be able to use this we need to check whether the current precision
> has enough bits to do the operation without any of the additions overflowing.
> 
> We use range information to determine this and only do the operation if we're
> sure am overflow won't occur.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going> issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	PR target/108583
> 	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Remove.
> 	* doc/tm.texi.in: Likewise.
> 	* explow.cc (round_push, align_dynamic_address): Revert previous patch.
> 	* expmed.cc (expand_divmod): Likewise.
> 	* expmed.h (expand_divmod): Likewise.
> 	* expr.cc (force_operand, expand_expr_divmod): Likewise.
> 	* optabs.cc (expand_doubleword_mod, expand_doubleword_divmod): Likewise.
> 	* internal-fn.def (ADDH): New.
> 	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
> 	* doc/md.texi: Document them.
> 	* doc/rtl.texi: Likewise.
> 	* target.def (can_special_div_by_const): Remove.
> 	* target.h: Remove tree-core.h include
> 	* targhooks.cc (default_can_special_div_by_const): Remove.
> 	* targhooks.h (default_can_special_div_by_const): Remove.
> 	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
> 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook and
> 	implement new obtab recognition based on range.
> 	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.
> 
> gcc/testsuite/ChangeLog:
> 
> 	PR target/108583
> 	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
> 	* gcc.dg/vect/vect-div-bitmask-5.c: New test.
> 
> --- inline copy of patch -- 
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f7408038595e21af35d 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5668,6 +5668,18 @@ represented in RTL using a @code{smul_highpart} RTX expression.
>  Similar, but the multiplication is unsigned.  This may be represented
>  in RTL using an @code{umul_highpart} RTX expression.
>  
> +@cindex @code{sadd@var{m}3_highpart} instruction pattern
> +@item @samp{smul@var{m}3_highpart}
> +Perform a signed addition of operands 1 and 2, which have mode
> +@var{m}, and store the most significant half of the product in operand 0.

of the sum

> +The least significant half of the product is discarded.  This may be
> +represented in RTL using a @code{sadd_highpart} RTX expression.

likewise.

> +
> +@cindex @code{uadd@var{m}3_highpart} instruction pattern
> +@item @samp{uadd@var{m}3_highpart}
> +Similar, but the addition is unsigned.  This may be represented
> +in RTL using an @code{uadd_highpart} RTX expression.
> +

is the highpart of the results sign- (for sadd) or zero- (for uadd)
extended to the full precision of the result mode? "store the most
significant half ... in operand 0" leaves that underspecified I think
(likewise for the mul_highpart pattern docs you copied this from).

Otherwise looks good to me.  Review would have been easier with
splitting the revert from the new implementation ...

Thanks,
Richard.

>  @cindex @code{madd@var{m}@var{n}4} instruction pattern
>  @item @samp{madd@var{m}@var{n}4}
>  Multiply operands 1 and 2, sign-extend them to mode @var{n}, add
> diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
> index d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00343d171940ec4222f3 100644
> --- a/gcc/doc/rtl.texi
> +++ b/gcc/doc/rtl.texi
> @@ -2535,6 +2535,17 @@ out in machine mode @var{m}.  @code{smul_highpart} returns the high part
>  of a signed multiplication, @code{umul_highpart} returns the high part
>  of an unsigned multiplication.
>  
> +@findex sadd_highpart
> +@findex uadd_highpart
> +@cindex high-part addition
> +@cindex addition high part
> +@item (sadd_highpart:@var{m} @var{x} @var{y})
> +@itemx (uadd_highpart:@var{m} @var{x} @var{y})
> +Represents the high-part addition of @var{x} and @var{y} carried
> +out in machine mode @var{m}.  @code{sadd_highpart} returns the high part
> +of a signed addition, @code{uadd_highpart} returns the high part
> +of an unsigned addition.
> +
>  @findex fma
>  @cindex fused multiply-add
>  @item (fma:@var{m} @var{x} @var{y} @var{z})
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d5791484017e6b0d62ab077e 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -6137,22 +6137,6 @@ instruction pattern.  There is no need for the hook to handle these two
>  implementation approaches itself.
>  @end deftypefn
>  
> -@deftypefn {Target Hook} bool TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum @var{tree_code}, tree @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx @var{in0}, rtx @var{in1})
> -This hook is used to test whether the target has a special method of
> -division of vectors of type @var{vectype} using the value @var{constant},
> -and producing a vector of type @var{vectype}.  The division
> -will then not be decomposed by the vectorizer and kept as a div.
> -
> -When the hook is being used to test whether the target supports a special
> -divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook
> -is being used to emit a division, @var{in0} and @var{in1} are the source
> -vectors of type @var{vecttype} and @var{output} is the destination vector of
> -type @var{vectype}.
> -
> -Return true if the operation is possible, emitting instructions for it
> -if rtxes are provided and updating @var{output}.
> -@end deftypefn
> -
>  @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
>  This hook should return the decl of a function that implements the
>  vectorized variant of the function with the @code{combined_fn} code
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d1efec0a3abccd1c293c7b 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -4173,8 +4173,6 @@ address;  but often a machine-dependent strategy can generate better code.
>  
>  @hook TARGET_VECTORIZE_VEC_PERM_CONST
>  
> -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
> -
>  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
>  
>  @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
> diff --git a/gcc/explow.cc b/gcc/explow.cc
> index 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0befa016eea4573c 100644
> --- a/gcc/explow.cc
> +++ b/gcc/explow.cc
> @@ -1037,7 +1037,7 @@ round_push (rtx size)
>       TRUNC_DIV_EXPR.  */
>    size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
>  		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
> -  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, size, align_rtx,
> +  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
>  			NULL_RTX, 1);
>    size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
>  
> @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned required_align)
>  			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
>  				       Pmode),
>  			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
> -  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, target,
> +  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
>  			  gen_int_mode (required_align / BITS_PER_UNIT,
>  					Pmode),
>  			  NULL_RTX, 1);
> diff --git a/gcc/expmed.h b/gcc/expmed.h
> index 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c53640941628068f3901 100644
> --- a/gcc/expmed.h
> +++ b/gcc/expmed.h
> @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code, machine_mode, rtx, poly_int64, rtx,
>  extern rtx maybe_expand_shift (enum tree_code, machine_mode, rtx, int, rtx,
>  			       int);
>  #ifdef GCC_OPTABS_H
> -extern rtx expand_divmod (int, enum tree_code, machine_mode, tree, tree,
> -			  rtx, rtx, rtx, int,
> -			  enum optab_methods = OPTAB_LIB_WIDEN);
> +extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx, rtx,
> +			  rtx, int, enum optab_methods = OPTAB_LIB_WIDEN);
>  #endif
>  #endif
>  
> diff --git a/gcc/expmed.cc b/gcc/expmed.cc
> index 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3a59c169d3b7692f 100644
> --- a/gcc/expmed.cc
> +++ b/gcc/expmed.cc
> @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode, rtx op0, HOST_WIDE_INT d)
>  
>  rtx
>  expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
> -	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
> -	       int unsignedp, enum optab_methods methods)
> +	       rtx op0, rtx op1, rtx target, int unsignedp,
> +	       enum optab_methods methods)
>  {
>    machine_mode compute_mode;
>    rtx tquotient;
> @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
>  
>    last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) : 0;
>  
> -  /* Check if the target has specific expansions for the division.  */
> -  tree cst;
> -  if (treeop0
> -      && treeop1
> -      && (cst = uniform_integer_cst_p (treeop1))
> -      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE (treeop0),
> -						     wi::to_wide (cst),
> -						     &target, op0, op1))
> -    return target;
> -
> -
>    /* Now convert to the best mode to use.  */
>    if (compute_mode != mode)
>      {
> @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
>  			    || (optab_handler (sdivmod_optab, int_mode)
>  				!= CODE_FOR_nothing)))
>  		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
> -						int_mode, treeop0, treeop1,
> -						op0, gen_int_mode (abs_d,
> +						int_mode, op0,
> +						gen_int_mode (abs_d,
>  							      int_mode),
>  						NULL_RTX, 0);
>  		    else
> @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
>  				      size - 1, NULL_RTX, 0);
>  		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
>  				    NULL_RTX);
> -		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, treeop0,
> -				    treeop1, t3, op1, NULL_RTX, 0);
> +		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3, op1,
> +				    NULL_RTX, 0);
>  		if (t4)
>  		  {
>  		    rtx t5;
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b2280c6e277f26d72 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
>  	    return expand_divmod (0,
>  				  FLOAT_MODE_P (GET_MODE (value))
>  				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
> -				  GET_MODE (value), NULL, NULL, op1, op2,
> -				  target, 0);
> +				  GET_MODE (value), op1, op2, target, 0);
>  	case MOD:
> -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
> -				op1, op2, target, 0);
> +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
> +				target, 0);
>  	case UDIV:
> -	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), NULL, NULL,
> -				op1, op2, target, 1);
> +	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), op1, op2,
> +				target, 1);
>  	case UMOD:
> -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
> -				op1, op2, target, 1);
> +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
> +				target, 1);
>  	case ASHIFTRT:
>  	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
>  				      target, 0, OPTAB_LIB_WIDEN);
> @@ -9170,13 +9169,11 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
>        bool speed_p = optimize_insn_for_speed_p ();
>        do_pending_stack_adjust ();
>        start_sequence ();
> -      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -				   op0, op1, target, 1);
> +      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 1);
>        rtx_insn *uns_insns = get_insns ();
>        end_sequence ();
>        start_sequence ();
> -      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -				   op0, op1, target, 0);
> +      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 0);
>        rtx_insn *sgn_insns = get_insns ();
>        end_sequence ();
>        unsigned uns_cost = seq_cost (uns_insns, speed_p);
> @@ -9198,8 +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
>        emit_insn (sgn_insns);
>        return sgn_ret;
>      }
> -  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -			op0, op1, target, unsignedp);
> +  return expand_divmod (mod_p, code, mode, op0, op1, target, unsignedp);
>  }
>  
>  rtx
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052584f5a3b8a734baa800f 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL, ECF_CONST | ECF_NOTHROW, first,
>  
>  DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST | ECF_NOTHROW, first,
>  			      smul_highpart, umul_highpart, binary)
> +DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST | ECF_NOTHROW, first,
> +			      sadd_highpart, uadd_highpart, binary)
>  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST | ECF_NOTHROW, first,
>  			      smulhs, umulhs, binary)
>  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | ECF_NOTHROW, first,
> diff --git a/gcc/optabs.cc b/gcc/optabs.cc
> index cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369a6e77082c1e617b 100644
> --- a/gcc/optabs.cc
> +++ b/gcc/optabs.cc
> @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode mode, rtx op0, rtx op1, bool unsignedp)
>  		return NULL_RTX;
>  	    }
>  	}
> -      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, NULL, NULL,
> -				     sum, gen_int_mode (INTVAL (op1),
> -							word_mode),
> +      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, sum,
> +				     gen_int_mode (INTVAL (op1), word_mode),
>  				     NULL_RTX, 1, OPTAB_DIRECT);
>        if (remainder == NULL_RTX)
>  	return NULL_RTX;
> @@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
>  
>    if (op11 != const1_rtx)
>      {
> -      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL, NULL, quot1,
> -				op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
> +      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1, op11,
> +				NULL_RTX, unsignedp, OPTAB_DIRECT);
>        if (rem2 == NULL_RTX)
>  	return NULL_RTX;
>  
> @@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
>        if (rem2 == NULL_RTX)
>  	return NULL_RTX;
>  
> -      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL, NULL, quot1,
> -				 op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
> +      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1, op11,
> +				 NULL_RTX, unsignedp, OPTAB_DIRECT);
>        if (quot2 == NULL_RTX)
>  	return NULL_RTX;
>  
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2d7a5ccbf6147947351a 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
>  
>  OPTAB_D (smul_highpart_optab, "smul$a3_highpart")
>  OPTAB_D (umul_highpart_optab, "umul$a3_highpart")
> +OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart")
> +OPTAB_D (uadd_highpart_optab, "uadd$a3_highpart")
>  
>  OPTAB_D (cmpmem_optab, "cmpmem$a")
>  OPTAB_D (cmpstr_optab, "cmpstr$a")
> diff --git a/gcc/target.def b/gcc/target.def
> index db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -1905,25 +1905,6 @@ implementation approaches itself.",
>  	const vec_perm_indices &sel),
>   NULL)
>  
> -DEFHOOK
> -(can_special_div_by_const,
> - "This hook is used to test whether the target has a special method of\n\
> -division of vectors of type @var{vectype} using the value @var{constant},\n\
> -and producing a vector of type @var{vectype}.  The division\n\
> -will then not be decomposed by the vectorizer and kept as a div.\n\
> -\n\
> -When the hook is being used to test whether the target supports a special\n\
> -divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook\n\
> -is being used to emit a division, @var{in0} and @var{in1} are the source\n\
> -vectors of type @var{vecttype} and @var{output} is the destination vector of\n\
> -type @var{vectype}.\n\
> -\n\
> -Return true if the operation is possible, emitting instructions for it\n\
> -if rtxes are provided and updating @var{output}.",
> - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
> -	rtx in0, rtx in1),
> - default_can_special_div_by_const)
> -
>  /* Return true if the target supports misaligned store/load of a
>     specific factor denoted in the third parameter.  The last parameter
>     is true if the access is defined in a packed struct.  */
> diff --git a/gcc/target.h b/gcc/target.h
> index 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b99f913158c2d47b1 100644
> --- a/gcc/target.h
> +++ b/gcc/target.h
> @@ -51,7 +51,6 @@
>  #include "insn-codes.h"
>  #include "tm.h"
>  #include "hard-reg-set.h"
> -#include "tree-core.h"
>  
>  #if CHECKING_P
>  
> diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> index a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad2244549317a31390f0c2 100644
> --- a/gcc/targhooks.h
> +++ b/gcc/targhooks.h
> @@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage (addr_space_t, location_t);
>  extern rtx default_addr_space_convert (rtx, tree, tree);
>  extern unsigned int default_case_values_threshold (void);
>  extern bool default_have_conditional_execution (void);
> -extern bool default_can_special_div_by_const (enum tree_code, tree, wide_int,
> -					      rtx *, rtx, rtx);
>  
>  extern bool default_libc_has_function (enum function_class, tree);
>  extern bool default_libc_has_fast_function (int fcode);
> diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> index fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91e03877337a931e7 100644
> --- a/gcc/targhooks.cc
> +++ b/gcc/targhooks.cc
> @@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
>    return HAVE_conditional_execution;
>  }
>  
> -/* Default that no division by constant operations are special.  */
> -bool
> -default_can_special_div_by_const (enum tree_code, tree, wide_int, rtx *, rtx,
> -				  rtx)
> -{
> -  return false;
> -}
> -
>  /* By default we assume that c99 functions are present at the runtime,
>     but sincos is not.  */
>  bool
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0a04ea8c1f73e3c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> @@ -0,0 +1,25 @@
> +/* { dg-require-effective-target vect_int } */
> +
> +#include <stdint.h>
> +#include "tree-vect.h"
> +
> +typedef unsigned __attribute__((__vector_size__ (16))) V;
> +
> +static __attribute__((__noinline__)) __attribute__((__noclone__)) V
> +foo (V v, unsigned short i)
> +{
> +  v /= i;
> +  return v;
> +}
> +
> +int
> +main (void)
> +{
> +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }, 0xffff);
> +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
> +    if (v[i] != 0x00010001)
> +      __builtin_abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern: detected" "vect" { target aarch64*-*-* } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b49914d2a29b933de625
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> @@ -0,0 +1,58 @@
> +/* { dg-require-effective-target vect_int } */
> +
> +#include <stdint.h>
> +#include <stdio.h>
> +#include "tree-vect.h"
> +
> +#define N 50
> +#define TYPE uint8_t 
> +
> +#ifndef DEBUG
> +#define DEBUG 0
> +#endif
> +
> +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> +
> +
> +__attribute__((noipa, noinline, optimize("O1")))
> +void fun1(TYPE* restrict pixel, TYPE level, int n)
> +{
> +  for (int i = 0; i < n; i+=1)
> +    pixel[i] = (pixel[i] + level) / 0xff;
> +}
> +
> +__attribute__((noipa, noinline, optimize("O3")))
> +void fun2(TYPE* restrict pixel, TYPE level, int n)
> +{
> +  for (int i = 0; i < n; i+=1)
> +    pixel[i] = (pixel[i] + level) / 0xff;
> +}
> +
> +int main ()
> +{
> +  TYPE a[N];
> +  TYPE b[N];
> +
> +  for (int i = 0; i < N; ++i)
> +    {
> +      a[i] = BASE + i * 13;
> +      b[i] = BASE + i * 13;
> +      if (DEBUG)
> +        printf ("%d: 0x%x\n", i, a[i]);
> +    }
> +
> +  fun1 (a, N / 2, N);
> +  fun2 (b, N / 2, N);
> +
> +  for (int i = 0; i < N; ++i)
> +    {
> +      if (DEBUG)
> +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> +
> +      if (a[i] != b[i])
> +        __builtin_abort ();
> +    }
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" { target aarch64*-*-* } } } */
> diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
> index 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077dc3e970bed75ef6 100644
> --- a/gcc/tree-vect-generic.cc
> +++ b/gcc/tree-vect-generic.cc
> @@ -1237,17 +1237,6 @@ expand_vector_operation (gimple_stmt_iterator *gsi, tree type, tree compute_type
>  	  tree rhs2 = gimple_assign_rhs2 (assign);
>  	  tree ret;
>  
> -	  /* Check if the target was going to handle it through the special
> -	     division callback hook.  */
> -	  tree cst = uniform_integer_cst_p (rhs2);
> -	  if (cst &&
> -	      targetm.vectorize.can_special_div_by_const (code, type,
> -							  wi::to_wide (cst),
> -							  NULL,
> -							  NULL_RTX, NULL_RTX))
> -	    return NULL_TREE;
> -
> -
>  	  if (!optimize
>  	      || !VECTOR_INTEGER_TYPE_P (type)
>  	      || TREE_CODE (rhs2) != VECTOR_CST
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f3369de2afea139d6 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info *vinfo,
>        return pattern_stmt;
>      }
>    else if ((cst = uniform_integer_cst_p (oprnd1))
> -	   && targetm.vectorize.can_special_div_by_const (rhs_code, vectype,
> -							  wi::to_wide (cst),
> -							  NULL, NULL_RTX,
> -							  NULL_RTX))
> +	   && TYPE_UNSIGNED (itype)
> +	   && rhs_code == TRUNC_DIV_EXPR
> +	   && vectype
> +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> +					      OPTIMIZE_FOR_SPEED))
>      {
> -      return NULL;
> +      /* div optimizations using narrowings
> +       we can do the division e.g. shorts by 255 faster by calculating it as
> +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> +       double the precision of x.
> +
> +       If we imagine a short as being composed of two blocks of bytes then
> +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
> +       adding 1 to each sub component:
> +
> +	    short value of 16-bits
> +       ┌──────────────┬────────────────┐
> +       │              │                │
> +       └──────────────┴────────────────┘
> +	 8-bit part1 ▲  8-bit part2   ▲
> +		     │                │
> +		     │                │
> +		    +1               +1
> +
> +       after the first addition, we have to shift right by 8, and narrow the
> +       results back to a byte.  Remember that the addition must be done in
> +       double the precision of the input.  However if we know that the addition
> +       `x + 257` does not overflow then we can do the operation in the current
> +       precision.  In which case we don't need the pack and unpacks.  */
> +      auto wcst = wi::to_wide (cst);
> +      int pow = wi::exact_log2 (wcst + 1);
> +      if (pow == (int) (element_precision (vectype) / 2))
> +	{
> +	  wide_int min,max;
> +	  /* If we're in a pattern we need to find the orginal definition.  */
> +	  tree op0 = oprnd0;
> +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> +	  if (is_pattern_stmt_p (stmt_info))
> +	    {
> +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
> +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
> +	    }
> +
> +	  /* Check that no overflow will occur.  If we don't have range
> +	     information we can't perform the optimization.  */
> +	  if (vect_get_range_info (op0, &min, &max))
> +	    {
> +	      wide_int one = wi::to_wide (build_one_cst (itype));
> +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> +	      wi::overflow_type ovf;
> +	      /* We need adder and max in the same precision.  */
> +	      wide_int zadder
> +		= wide_int_storage::from (adder, wi::get_precision (max),
> +					  UNSIGNED);
> +	      wi::add (max, zadder, UNSIGNED, &ovf);
> +	      if (ovf == wi::OVF_NONE)
> +		{
> +		  *type_out = vectype;
> +		  tree tadder = wide_int_to_tree (itype, adder);
> +		  gcall *patt1
> +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, tadder);
> +		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
> +		  gimple_call_set_lhs (patt1, lhs);
> +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
> +
> +		  pattern_stmt
> +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
> +		  lhs = vect_recog_temp_ssa_var (itype, NULL);
> +		  gimple_call_set_lhs (pattern_stmt, lhs);
> +
> +		  return pattern_stmt;
> +		}
> +	    }
> +	}
>      }
>  
>    if (prec > HOST_BITS_PER_WIDE_INT
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b9564fc4e066e50081 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
>  	}
>        target_support_p = (optab_handler (optab, vec_mode)
>  			  != CODE_FOR_nothing);
> -      tree cst;
> -      if (!target_support_p
> -	  && op1
> -	  && (cst = uniform_integer_cst_p (op1)))
> -	target_support_p
> -	  = targetm.vectorize.can_special_div_by_const (code, vectype,
> -							wi::to_wide (cst),
> -							NULL, NULL_RTX,
> -							NULL_RTX);
>      }
>  
>    bool using_emulated_vectors_p = vect_emulated_vector_p (vectype);
> 
> 
> 
> 
>
Richard Sandiford Feb. 10, 2023, 1:36 p.m. UTC | #3
I think I'm misunderstanding, but: it seems like we're treating the
add highpart optabs as companions to the mul highpart optabs.  But AIUI,
the add highpart optab is used such that, for an N-bit mode, we do
an N-bit addition followed by a shift by N/2.  Is that right?
The mul highpart optabs instead do an 2N-bit multiplication followed
by a shift by N.

Apart from consistency, the reason this matters is: I'm not sure what we
gain by adding the optab rather than simply open-coding the addition and
the shift directly into the vector pattern.  It seems like the AArch64
expander in 2/2 does just do an ordinary N-bit addition followed by an
ordinary shift by N/2.

Some comments in addition to Richard's:

Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> Hi All,
>
> As discussed in the ticket, this replaces the approach for optimizing the
> div by bitmask operation from a hook into optabs implemented through
> add_highpart.
>
> In order to be able to use this we need to check whether the current precision
> has enough bits to do the operation without any of the additions overflowing.
>
> We use range information to determine this and only do the operation if we're
> sure am overflow won't occur.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going> issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	PR target/108583
> 	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Remove.
> 	* doc/tm.texi.in: Likewise.
> 	* explow.cc (round_push, align_dynamic_address): Revert previous patch.
> 	* expmed.cc (expand_divmod): Likewise.
> 	* expmed.h (expand_divmod): Likewise.
> 	* expr.cc (force_operand, expand_expr_divmod): Likewise.
> 	* optabs.cc (expand_doubleword_mod, expand_doubleword_divmod): Likewise.
> 	* internal-fn.def (ADDH): New.
> 	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
> 	* doc/md.texi: Document them.
> 	* doc/rtl.texi: Likewise.
> 	* target.def (can_special_div_by_const): Remove.
> 	* target.h: Remove tree-core.h include
> 	* targhooks.cc (default_can_special_div_by_const): Remove.
> 	* targhooks.h (default_can_special_div_by_const): Remove.
> 	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
> 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook and
> 	implement new obtab recognition based on range.
> 	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.
>
> gcc/testsuite/ChangeLog:
>
> 	PR target/108583
> 	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
> 	* gcc.dg/vect/vect-div-bitmask-5.c: New test.
>
> --- inline copy of patch -- 
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f7408038595e21af35d 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5668,6 +5668,18 @@ represented in RTL using a @code{smul_highpart} RTX expression.
>  Similar, but the multiplication is unsigned.  This may be represented
>  in RTL using an @code{umul_highpart} RTX expression.
>  
> +@cindex @code{sadd@var{m}3_highpart} instruction pattern
> +@item @samp{smul@var{m}3_highpart}

sadd

> +Perform a signed addition of operands 1 and 2, which have mode
> +@var{m}, and store the most significant half of the product in operand 0.
> +The least significant half of the product is discarded.  This may be
> +represented in RTL using a @code{sadd_highpart} RTX expression.
> +
> +@cindex @code{uadd@var{m}3_highpart} instruction pattern
> +@item @samp{uadd@var{m}3_highpart}
> +Similar, but the addition is unsigned.  This may be represented
> +in RTL using an @code{uadd_highpart} RTX expression.
> +
>  @cindex @code{madd@var{m}@var{n}4} instruction pattern
>  @item @samp{madd@var{m}@var{n}4}
>  Multiply operands 1 and 2, sign-extend them to mode @var{n}, add
> diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
> index d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00343d171940ec4222f3 100644
> --- a/gcc/doc/rtl.texi
> +++ b/gcc/doc/rtl.texi
> @@ -2535,6 +2535,17 @@ out in machine mode @var{m}.  @code{smul_highpart} returns the high part
>  of a signed multiplication, @code{umul_highpart} returns the high part
>  of an unsigned multiplication.
>  
> +@findex sadd_highpart
> +@findex uadd_highpart
> +@cindex high-part addition
> +@cindex addition high part
> +@item (sadd_highpart:@var{m} @var{x} @var{y})
> +@itemx (uadd_highpart:@var{m} @var{x} @var{y})
> +Represents the high-part addition of @var{x} and @var{y} carried
> +out in machine mode @var{m}.  @code{sadd_highpart} returns the high part
> +of a signed addition, @code{uadd_highpart} returns the high part
> +of an unsigned addition.

The patch doesn't add these RTL codes though.

> +
>  @findex fma
>  @cindex fused multiply-add
>  @item (fma:@var{m} @var{x} @var{y} @var{z})
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d5791484017e6b0d62ab077e 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -6137,22 +6137,6 @@ instruction pattern.  There is no need for the hook to handle these two
>  implementation approaches itself.
>  @end deftypefn
>  
> -@deftypefn {Target Hook} bool TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum @var{tree_code}, tree @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx @var{in0}, rtx @var{in1})
> -This hook is used to test whether the target has a special method of
> -division of vectors of type @var{vectype} using the value @var{constant},
> -and producing a vector of type @var{vectype}.  The division
> -will then not be decomposed by the vectorizer and kept as a div.
> -
> -When the hook is being used to test whether the target supports a special
> -divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook
> -is being used to emit a division, @var{in0} and @var{in1} are the source
> -vectors of type @var{vecttype} and @var{output} is the destination vector of
> -type @var{vectype}.
> -
> -Return true if the operation is possible, emitting instructions for it
> -if rtxes are provided and updating @var{output}.
> -@end deftypefn
> -
>  @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
>  This hook should return the decl of a function that implements the
>  vectorized variant of the function with the @code{combined_fn} code
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d1efec0a3abccd1c293c7b 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -4173,8 +4173,6 @@ address;  but often a machine-dependent strategy can generate better code.
>  
>  @hook TARGET_VECTORIZE_VEC_PERM_CONST
>  
> -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
> -
>  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
>  
>  @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
> diff --git a/gcc/explow.cc b/gcc/explow.cc
> index 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0befa016eea4573c 100644
> --- a/gcc/explow.cc
> +++ b/gcc/explow.cc
> @@ -1037,7 +1037,7 @@ round_push (rtx size)
>       TRUNC_DIV_EXPR.  */
>    size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
>  		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
> -  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, size, align_rtx,
> +  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
>  			NULL_RTX, 1);
>    size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
>  
> @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned required_align)
>  			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
>  				       Pmode),
>  			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
> -  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, target,
> +  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
>  			  gen_int_mode (required_align / BITS_PER_UNIT,
>  					Pmode),
>  			  NULL_RTX, 1);
> diff --git a/gcc/expmed.h b/gcc/expmed.h
> index 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c53640941628068f3901 100644
> --- a/gcc/expmed.h
> +++ b/gcc/expmed.h
> @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code, machine_mode, rtx, poly_int64, rtx,
>  extern rtx maybe_expand_shift (enum tree_code, machine_mode, rtx, int, rtx,
>  			       int);
>  #ifdef GCC_OPTABS_H
> -extern rtx expand_divmod (int, enum tree_code, machine_mode, tree, tree,
> -			  rtx, rtx, rtx, int,
> -			  enum optab_methods = OPTAB_LIB_WIDEN);
> +extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx, rtx,
> +			  rtx, int, enum optab_methods = OPTAB_LIB_WIDEN);
>  #endif
>  #endif
>  
> diff --git a/gcc/expmed.cc b/gcc/expmed.cc
> index 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3a59c169d3b7692f 100644
> --- a/gcc/expmed.cc
> +++ b/gcc/expmed.cc
> @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode, rtx op0, HOST_WIDE_INT d)
>  
>  rtx
>  expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
> -	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
> -	       int unsignedp, enum optab_methods methods)
> +	       rtx op0, rtx op1, rtx target, int unsignedp,
> +	       enum optab_methods methods)
>  {
>    machine_mode compute_mode;
>    rtx tquotient;
> @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
>  
>    last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) : 0;
>  
> -  /* Check if the target has specific expansions for the division.  */
> -  tree cst;
> -  if (treeop0
> -      && treeop1
> -      && (cst = uniform_integer_cst_p (treeop1))
> -      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE (treeop0),
> -						     wi::to_wide (cst),
> -						     &target, op0, op1))
> -    return target;
> -
> -
>    /* Now convert to the best mode to use.  */
>    if (compute_mode != mode)
>      {
> @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
>  			    || (optab_handler (sdivmod_optab, int_mode)
>  				!= CODE_FOR_nothing)))
>  		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
> -						int_mode, treeop0, treeop1,
> -						op0, gen_int_mode (abs_d,
> +						int_mode, op0,
> +						gen_int_mode (abs_d,
>  							      int_mode),
>  						NULL_RTX, 0);
>  		    else
> @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
>  				      size - 1, NULL_RTX, 0);
>  		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
>  				    NULL_RTX);
> -		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, treeop0,
> -				    treeop1, t3, op1, NULL_RTX, 0);
> +		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3, op1,
> +				    NULL_RTX, 0);
>  		if (t4)
>  		  {
>  		    rtx t5;
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b2280c6e277f26d72 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
>  	    return expand_divmod (0,
>  				  FLOAT_MODE_P (GET_MODE (value))
>  				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
> -				  GET_MODE (value), NULL, NULL, op1, op2,
> -				  target, 0);
> +				  GET_MODE (value), op1, op2, target, 0);
>  	case MOD:
> -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
> -				op1, op2, target, 0);
> +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
> +				target, 0);
>  	case UDIV:
> -	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), NULL, NULL,
> -				op1, op2, target, 1);
> +	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), op1, op2,
> +				target, 1);
>  	case UMOD:
> -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
> -				op1, op2, target, 1);
> +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
> +				target, 1);
>  	case ASHIFTRT:
>  	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
>  				      target, 0, OPTAB_LIB_WIDEN);
> @@ -9170,13 +9169,11 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
>        bool speed_p = optimize_insn_for_speed_p ();
>        do_pending_stack_adjust ();
>        start_sequence ();
> -      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -				   op0, op1, target, 1);
> +      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 1);
>        rtx_insn *uns_insns = get_insns ();
>        end_sequence ();
>        start_sequence ();
> -      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -				   op0, op1, target, 0);
> +      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 0);
>        rtx_insn *sgn_insns = get_insns ();
>        end_sequence ();
>        unsigned uns_cost = seq_cost (uns_insns, speed_p);
> @@ -9198,8 +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
>        emit_insn (sgn_insns);
>        return sgn_ret;
>      }
> -  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -			op0, op1, target, unsignedp);
> +  return expand_divmod (mod_p, code, mode, op0, op1, target, unsignedp);
>  }
>  
>  rtx
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052584f5a3b8a734baa800f 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL, ECF_CONST | ECF_NOTHROW, first,
>  
>  DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST | ECF_NOTHROW, first,
>  			      smul_highpart, umul_highpart, binary)
> +DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST | ECF_NOTHROW, first,
> +			      sadd_highpart, uadd_highpart, binary)
>  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST | ECF_NOTHROW, first,
>  			      smulhs, umulhs, binary)
>  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | ECF_NOTHROW, first,
> diff --git a/gcc/optabs.cc b/gcc/optabs.cc
> index cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369a6e77082c1e617b 100644
> --- a/gcc/optabs.cc
> +++ b/gcc/optabs.cc
> @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode mode, rtx op0, rtx op1, bool unsignedp)
>  		return NULL_RTX;
>  	    }
>  	}
> -      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, NULL, NULL,
> -				     sum, gen_int_mode (INTVAL (op1),
> -							word_mode),
> +      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, sum,
> +				     gen_int_mode (INTVAL (op1), word_mode),
>  				     NULL_RTX, 1, OPTAB_DIRECT);
>        if (remainder == NULL_RTX)
>  	return NULL_RTX;
> @@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
>  
>    if (op11 != const1_rtx)
>      {
> -      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL, NULL, quot1,
> -				op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
> +      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1, op11,
> +				NULL_RTX, unsignedp, OPTAB_DIRECT);
>        if (rem2 == NULL_RTX)
>  	return NULL_RTX;
>  
> @@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
>        if (rem2 == NULL_RTX)
>  	return NULL_RTX;
>  
> -      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL, NULL, quot1,
> -				 op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
> +      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1, op11,
> +				 NULL_RTX, unsignedp, OPTAB_DIRECT);
>        if (quot2 == NULL_RTX)
>  	return NULL_RTX;
>  
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2d7a5ccbf6147947351a 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
>  
>  OPTAB_D (smul_highpart_optab, "smul$a3_highpart")
>  OPTAB_D (umul_highpart_optab, "umul$a3_highpart")
> +OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart")
> +OPTAB_D (uadd_highpart_optab, "uadd$a3_highpart")
>  
>  OPTAB_D (cmpmem_optab, "cmpmem$a")
>  OPTAB_D (cmpstr_optab, "cmpstr$a")
> diff --git a/gcc/target.def b/gcc/target.def
> index db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -1905,25 +1905,6 @@ implementation approaches itself.",
>  	const vec_perm_indices &sel),
>   NULL)
>  
> -DEFHOOK
> -(can_special_div_by_const,
> - "This hook is used to test whether the target has a special method of\n\
> -division of vectors of type @var{vectype} using the value @var{constant},\n\
> -and producing a vector of type @var{vectype}.  The division\n\
> -will then not be decomposed by the vectorizer and kept as a div.\n\
> -\n\
> -When the hook is being used to test whether the target supports a special\n\
> -divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook\n\
> -is being used to emit a division, @var{in0} and @var{in1} are the source\n\
> -vectors of type @var{vecttype} and @var{output} is the destination vector of\n\
> -type @var{vectype}.\n\
> -\n\
> -Return true if the operation is possible, emitting instructions for it\n\
> -if rtxes are provided and updating @var{output}.",
> - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
> -	rtx in0, rtx in1),
> - default_can_special_div_by_const)
> -
>  /* Return true if the target supports misaligned store/load of a
>     specific factor denoted in the third parameter.  The last parameter
>     is true if the access is defined in a packed struct.  */
> diff --git a/gcc/target.h b/gcc/target.h
> index 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b99f913158c2d47b1 100644
> --- a/gcc/target.h
> +++ b/gcc/target.h
> @@ -51,7 +51,6 @@
>  #include "insn-codes.h"
>  #include "tm.h"
>  #include "hard-reg-set.h"
> -#include "tree-core.h"
>  
>  #if CHECKING_P
>  
> diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> index a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad2244549317a31390f0c2 100644
> --- a/gcc/targhooks.h
> +++ b/gcc/targhooks.h
> @@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage (addr_space_t, location_t);
>  extern rtx default_addr_space_convert (rtx, tree, tree);
>  extern unsigned int default_case_values_threshold (void);
>  extern bool default_have_conditional_execution (void);
> -extern bool default_can_special_div_by_const (enum tree_code, tree, wide_int,
> -					      rtx *, rtx, rtx);
>  
>  extern bool default_libc_has_function (enum function_class, tree);
>  extern bool default_libc_has_fast_function (int fcode);
> diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> index fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91e03877337a931e7 100644
> --- a/gcc/targhooks.cc
> +++ b/gcc/targhooks.cc
> @@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
>    return HAVE_conditional_execution;
>  }
>  
> -/* Default that no division by constant operations are special.  */
> -bool
> -default_can_special_div_by_const (enum tree_code, tree, wide_int, rtx *, rtx,
> -				  rtx)
> -{
> -  return false;
> -}
> -
>  /* By default we assume that c99 functions are present at the runtime,
>     but sincos is not.  */
>  bool
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0a04ea8c1f73e3c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> @@ -0,0 +1,25 @@
> +/* { dg-require-effective-target vect_int } */
> +
> +#include <stdint.h>
> +#include "tree-vect.h"
> +
> +typedef unsigned __attribute__((__vector_size__ (16))) V;
> +
> +static __attribute__((__noinline__)) __attribute__((__noclone__)) V
> +foo (V v, unsigned short i)
> +{
> +  v /= i;
> +  return v;
> +}
> +
> +int
> +main (void)
> +{
> +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }, 0xffff);
> +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
> +    if (v[i] != 0x00010001)
> +      __builtin_abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern: detected" "vect" { target aarch64*-*-* } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b49914d2a29b933de625
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> @@ -0,0 +1,58 @@
> +/* { dg-require-effective-target vect_int } */
> +
> +#include <stdint.h>
> +#include <stdio.h>
> +#include "tree-vect.h"
> +
> +#define N 50
> +#define TYPE uint8_t 
> +
> +#ifndef DEBUG
> +#define DEBUG 0
> +#endif
> +
> +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> +
> +
> +__attribute__((noipa, noinline, optimize("O1")))
> +void fun1(TYPE* restrict pixel, TYPE level, int n)
> +{
> +  for (int i = 0; i < n; i+=1)
> +    pixel[i] = (pixel[i] + level) / 0xff;
> +}
> +
> +__attribute__((noipa, noinline, optimize("O3")))
> +void fun2(TYPE* restrict pixel, TYPE level, int n)
> +{
> +  for (int i = 0; i < n; i+=1)
> +    pixel[i] = (pixel[i] + level) / 0xff;
> +}
> +
> +int main ()
> +{
> +  TYPE a[N];
> +  TYPE b[N];
> +
> +  for (int i = 0; i < N; ++i)
> +    {
> +      a[i] = BASE + i * 13;
> +      b[i] = BASE + i * 13;
> +      if (DEBUG)
> +        printf ("%d: 0x%x\n", i, a[i]);
> +    }
> +
> +  fun1 (a, N / 2, N);
> +  fun2 (b, N / 2, N);
> +
> +  for (int i = 0; i < N; ++i)
> +    {
> +      if (DEBUG)
> +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> +
> +      if (a[i] != b[i])
> +        __builtin_abort ();
> +    }
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" { target aarch64*-*-* } } } */
> diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
> index 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077dc3e970bed75ef6 100644
> --- a/gcc/tree-vect-generic.cc
> +++ b/gcc/tree-vect-generic.cc
> @@ -1237,17 +1237,6 @@ expand_vector_operation (gimple_stmt_iterator *gsi, tree type, tree compute_type
>  	  tree rhs2 = gimple_assign_rhs2 (assign);
>  	  tree ret;
>  
> -	  /* Check if the target was going to handle it through the special
> -	     division callback hook.  */
> -	  tree cst = uniform_integer_cst_p (rhs2);
> -	  if (cst &&
> -	      targetm.vectorize.can_special_div_by_const (code, type,
> -							  wi::to_wide (cst),
> -							  NULL,
> -							  NULL_RTX, NULL_RTX))
> -	    return NULL_TREE;
> -
> -
>  	  if (!optimize
>  	      || !VECTOR_INTEGER_TYPE_P (type)
>  	      || TREE_CODE (rhs2) != VECTOR_CST
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f3369de2afea139d6 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info *vinfo,
>        return pattern_stmt;
>      }
>    else if ((cst = uniform_integer_cst_p (oprnd1))
> -	   && targetm.vectorize.can_special_div_by_const (rhs_code, vectype,
> -							  wi::to_wide (cst),
> -							  NULL, NULL_RTX,
> -							  NULL_RTX))
> +	   && TYPE_UNSIGNED (itype)
> +	   && rhs_code == TRUNC_DIV_EXPR
> +	   && vectype
> +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> +					      OPTIMIZE_FOR_SPEED))
>      {
> -      return NULL;
> +      /* div optimizations using narrowings
> +       we can do the division e.g. shorts by 255 faster by calculating it as
> +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> +       double the precision of x.
> +
> +       If we imagine a short as being composed of two blocks of bytes then
> +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
> +       adding 1 to each sub component:
> +
> +	    short value of 16-bits
> +       ┌──────────────┬────────────────┐
> +       │              │                │
> +       └──────────────┴────────────────┘
> +	 8-bit part1 ▲  8-bit part2   ▲
> +		     │                │
> +		     │                │
> +		    +1               +1
> +
> +       after the first addition, we have to shift right by 8, and narrow the
> +       results back to a byte.  Remember that the addition must be done in
> +       double the precision of the input.  However if we know that the addition
> +       `x + 257` does not overflow then we can do the operation in the current
> +       precision.  In which case we don't need the pack and unpacks.  */
> +      auto wcst = wi::to_wide (cst);
> +      int pow = wi::exact_log2 (wcst + 1);
> +      if (pow == (int) (element_precision (vectype) / 2))
> +	{
> +	  wide_int min,max;
> +	  /* If we're in a pattern we need to find the orginal definition.  */
> +	  tree op0 = oprnd0;
> +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> +	  if (is_pattern_stmt_p (stmt_info))
> +	    {
> +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
> +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
> +	    }

If this is generally safe (I'm skipping thinking about it in the
interests of a quick review :-)), then I think it should be done in
vect_get_range_info instead.  Using gimple_get_lhs would be more
general than handling just assignments.

> +
> +	  /* Check that no overflow will occur.  If we don't have range
> +	     information we can't perform the optimization.  */
> +	  if (vect_get_range_info (op0, &min, &max))
> +	    {
> +	      wide_int one = wi::to_wide (build_one_cst (itype));
> +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> +	      wi::overflow_type ovf;
> +	      /* We need adder and max in the same precision.  */
> +	      wide_int zadder
> +		= wide_int_storage::from (adder, wi::get_precision (max),
> +					  UNSIGNED);
> +	      wi::add (max, zadder, UNSIGNED, &ovf);

Could you explain this a bit more?  When do we have mismatched precisions?

Thanks,
Richard

> +	      if (ovf == wi::OVF_NONE)
> +		{
> +		  *type_out = vectype;
> +		  tree tadder = wide_int_to_tree (itype, adder);
> +		  gcall *patt1
> +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, tadder);
> +		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
> +		  gimple_call_set_lhs (patt1, lhs);
> +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
> +
> +		  pattern_stmt
> +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
> +		  lhs = vect_recog_temp_ssa_var (itype, NULL);
> +		  gimple_call_set_lhs (pattern_stmt, lhs);
> +
> +		  return pattern_stmt;
> +		}
> +	    }
> +	}
>      }
>  
>    if (prec > HOST_BITS_PER_WIDE_INT
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b9564fc4e066e50081 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
>  	}
>        target_support_p = (optab_handler (optab, vec_mode)
>  			  != CODE_FOR_nothing);
> -      tree cst;
> -      if (!target_support_p
> -	  && op1
> -	  && (cst = uniform_integer_cst_p (op1)))
> -	target_support_p
> -	  = targetm.vectorize.can_special_div_by_const (code, vectype,
> -							wi::to_wide (cst),
> -							NULL, NULL_RTX,
> -							NULL_RTX);
>      }
>  
>    bool using_emulated_vectors_p = vect_emulated_vector_p (vectype);
Richard Biener Feb. 10, 2023, 1:52 p.m. UTC | #4
On Fri, 10 Feb 2023, Richard Sandiford wrote:

> I think I'm misunderstanding, but: it seems like we're treating the
> add highpart optabs as companions to the mul highpart optabs.  But AIUI,
> the add highpart optab is used such that, for an N-bit mode, we do
> an N-bit addition followed by a shift by N/2.  Is that right?
> The mul highpart optabs instead do an 2N-bit multiplication followed
> by a shift by N.

That also confused me - and the docs add to the confusion more than
clear it up ... I agree we should be consistent in the semantics
for add_highpart and mul_highpart.

> Apart from consistency, the reason this matters is: I'm not sure what we
> gain by adding the optab rather than simply open-coding the addition and
> the shift directly into the vector pattern.  It seems like the AArch64
> expander in 2/2 does just do an ordinary N-bit addition followed by an
> ordinary shift by N/2.
> 
> Some comments in addition to Richard's:
> 
> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > Hi All,
> >
> > As discussed in the ticket, this replaces the approach for optimizing the
> > div by bitmask operation from a hook into optabs implemented through
> > add_highpart.
> >
> > In order to be able to use this we need to check whether the current precision
> > has enough bits to do the operation without any of the additions overflowing.
> >
> > We use range information to determine this and only do the operation if we're
> > sure am overflow won't occur.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going> issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	PR target/108583
> > 	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Remove.
> > 	* doc/tm.texi.in: Likewise.
> > 	* explow.cc (round_push, align_dynamic_address): Revert previous patch.
> > 	* expmed.cc (expand_divmod): Likewise.
> > 	* expmed.h (expand_divmod): Likewise.
> > 	* expr.cc (force_operand, expand_expr_divmod): Likewise.
> > 	* optabs.cc (expand_doubleword_mod, expand_doubleword_divmod): Likewise.
> > 	* internal-fn.def (ADDH): New.
> > 	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
> > 	* doc/md.texi: Document them.
> > 	* doc/rtl.texi: Likewise.
> > 	* target.def (can_special_div_by_const): Remove.
> > 	* target.h: Remove tree-core.h include
> > 	* targhooks.cc (default_can_special_div_by_const): Remove.
> > 	* targhooks.h (default_can_special_div_by_const): Remove.
> > 	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
> > 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook and
> > 	implement new obtab recognition based on range.
> > 	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 	PR target/108583
> > 	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
> > 	* gcc.dg/vect/vect-div-bitmask-5.c: New test.
> >
> > --- inline copy of patch -- 
> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > index 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f7408038595e21af35d 100644
> > --- a/gcc/doc/md.texi
> > +++ b/gcc/doc/md.texi
> > @@ -5668,6 +5668,18 @@ represented in RTL using a @code{smul_highpart} RTX expression.
> >  Similar, but the multiplication is unsigned.  This may be represented
> >  in RTL using an @code{umul_highpart} RTX expression.
> >  
> > +@cindex @code{sadd@var{m}3_highpart} instruction pattern
> > +@item @samp{smul@var{m}3_highpart}
> 
> sadd
> 
> > +Perform a signed addition of operands 1 and 2, which have mode
> > +@var{m}, and store the most significant half of the product in operand 0.
> > +The least significant half of the product is discarded.  This may be
> > +represented in RTL using a @code{sadd_highpart} RTX expression.
> > +
> > +@cindex @code{uadd@var{m}3_highpart} instruction pattern
> > +@item @samp{uadd@var{m}3_highpart}
> > +Similar, but the addition is unsigned.  This may be represented
> > +in RTL using an @code{uadd_highpart} RTX expression.
> > +
> >  @cindex @code{madd@var{m}@var{n}4} instruction pattern
> >  @item @samp{madd@var{m}@var{n}4}
> >  Multiply operands 1 and 2, sign-extend them to mode @var{n}, add
> > diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
> > index d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00343d171940ec4222f3 100644
> > --- a/gcc/doc/rtl.texi
> > +++ b/gcc/doc/rtl.texi
> > @@ -2535,6 +2535,17 @@ out in machine mode @var{m}.  @code{smul_highpart} returns the high part
> >  of a signed multiplication, @code{umul_highpart} returns the high part
> >  of an unsigned multiplication.
> >  
> > +@findex sadd_highpart
> > +@findex uadd_highpart
> > +@cindex high-part addition
> > +@cindex addition high part
> > +@item (sadd_highpart:@var{m} @var{x} @var{y})
> > +@itemx (uadd_highpart:@var{m} @var{x} @var{y})
> > +Represents the high-part addition of @var{x} and @var{y} carried
> > +out in machine mode @var{m}.  @code{sadd_highpart} returns the high part
> > +of a signed addition, @code{uadd_highpart} returns the high part
> > +of an unsigned addition.
> 
> The patch doesn't add these RTL codes though.
> 
> > +
> >  @findex fma
> >  @cindex fused multiply-add
> >  @item (fma:@var{m} @var{x} @var{y} @var{z})
> > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > index c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d5791484017e6b0d62ab077e 100644
> > --- a/gcc/doc/tm.texi
> > +++ b/gcc/doc/tm.texi
> > @@ -6137,22 +6137,6 @@ instruction pattern.  There is no need for the hook to handle these two
> >  implementation approaches itself.
> >  @end deftypefn
> >  
> > -@deftypefn {Target Hook} bool TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum @var{tree_code}, tree @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx @var{in0}, rtx @var{in1})
> > -This hook is used to test whether the target has a special method of
> > -division of vectors of type @var{vectype} using the value @var{constant},
> > -and producing a vector of type @var{vectype}.  The division
> > -will then not be decomposed by the vectorizer and kept as a div.
> > -
> > -When the hook is being used to test whether the target supports a special
> > -divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook
> > -is being used to emit a division, @var{in0} and @var{in1} are the source
> > -vectors of type @var{vecttype} and @var{output} is the destination vector of
> > -type @var{vectype}.
> > -
> > -Return true if the operation is possible, emitting instructions for it
> > -if rtxes are provided and updating @var{output}.
> > -@end deftypefn
> > -
> >  @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
> >  This hook should return the decl of a function that implements the
> >  vectorized variant of the function with the @code{combined_fn} code
> > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > index 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d1efec0a3abccd1c293c7b 100644
> > --- a/gcc/doc/tm.texi.in
> > +++ b/gcc/doc/tm.texi.in
> > @@ -4173,8 +4173,6 @@ address;  but often a machine-dependent strategy can generate better code.
> >  
> >  @hook TARGET_VECTORIZE_VEC_PERM_CONST
> >  
> > -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
> > -
> >  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
> >  
> >  @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
> > diff --git a/gcc/explow.cc b/gcc/explow.cc
> > index 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0befa016eea4573c 100644
> > --- a/gcc/explow.cc
> > +++ b/gcc/explow.cc
> > @@ -1037,7 +1037,7 @@ round_push (rtx size)
> >       TRUNC_DIV_EXPR.  */
> >    size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
> >  		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
> > -  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, size, align_rtx,
> > +  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
> >  			NULL_RTX, 1);
> >    size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
> >  
> > @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned required_align)
> >  			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
> >  				       Pmode),
> >  			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
> > -  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, target,
> > +  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
> >  			  gen_int_mode (required_align / BITS_PER_UNIT,
> >  					Pmode),
> >  			  NULL_RTX, 1);
> > diff --git a/gcc/expmed.h b/gcc/expmed.h
> > index 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c53640941628068f3901 100644
> > --- a/gcc/expmed.h
> > +++ b/gcc/expmed.h
> > @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code, machine_mode, rtx, poly_int64, rtx,
> >  extern rtx maybe_expand_shift (enum tree_code, machine_mode, rtx, int, rtx,
> >  			       int);
> >  #ifdef GCC_OPTABS_H
> > -extern rtx expand_divmod (int, enum tree_code, machine_mode, tree, tree,
> > -			  rtx, rtx, rtx, int,
> > -			  enum optab_methods = OPTAB_LIB_WIDEN);
> > +extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx, rtx,
> > +			  rtx, int, enum optab_methods = OPTAB_LIB_WIDEN);
> >  #endif
> >  #endif
> >  
> > diff --git a/gcc/expmed.cc b/gcc/expmed.cc
> > index 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3a59c169d3b7692f 100644
> > --- a/gcc/expmed.cc
> > +++ b/gcc/expmed.cc
> > @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode, rtx op0, HOST_WIDE_INT d)
> >  
> >  rtx
> >  expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
> > -	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
> > -	       int unsignedp, enum optab_methods methods)
> > +	       rtx op0, rtx op1, rtx target, int unsignedp,
> > +	       enum optab_methods methods)
> >  {
> >    machine_mode compute_mode;
> >    rtx tquotient;
> > @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
> >  
> >    last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) : 0;
> >  
> > -  /* Check if the target has specific expansions for the division.  */
> > -  tree cst;
> > -  if (treeop0
> > -      && treeop1
> > -      && (cst = uniform_integer_cst_p (treeop1))
> > -      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE (treeop0),
> > -						     wi::to_wide (cst),
> > -						     &target, op0, op1))
> > -    return target;
> > -
> > -
> >    /* Now convert to the best mode to use.  */
> >    if (compute_mode != mode)
> >      {
> > @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
> >  			    || (optab_handler (sdivmod_optab, int_mode)
> >  				!= CODE_FOR_nothing)))
> >  		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
> > -						int_mode, treeop0, treeop1,
> > -						op0, gen_int_mode (abs_d,
> > +						int_mode, op0,
> > +						gen_int_mode (abs_d,
> >  							      int_mode),
> >  						NULL_RTX, 0);
> >  		    else
> > @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
> >  				      size - 1, NULL_RTX, 0);
> >  		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
> >  				    NULL_RTX);
> > -		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, treeop0,
> > -				    treeop1, t3, op1, NULL_RTX, 0);
> > +		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3, op1,
> > +				    NULL_RTX, 0);
> >  		if (t4)
> >  		  {
> >  		    rtx t5;
> > diff --git a/gcc/expr.cc b/gcc/expr.cc
> > index 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b2280c6e277f26d72 100644
> > --- a/gcc/expr.cc
> > +++ b/gcc/expr.cc
> > @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
> >  	    return expand_divmod (0,
> >  				  FLOAT_MODE_P (GET_MODE (value))
> >  				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
> > -				  GET_MODE (value), NULL, NULL, op1, op2,
> > -				  target, 0);
> > +				  GET_MODE (value), op1, op2, target, 0);
> >  	case MOD:
> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
> > -				op1, op2, target, 0);
> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
> > +				target, 0);
> >  	case UDIV:
> > -	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), NULL, NULL,
> > -				op1, op2, target, 1);
> > +	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), op1, op2,
> > +				target, 1);
> >  	case UMOD:
> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
> > -				op1, op2, target, 1);
> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
> > +				target, 1);
> >  	case ASHIFTRT:
> >  	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
> >  				      target, 0, OPTAB_LIB_WIDEN);
> > @@ -9170,13 +9169,11 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
> >        bool speed_p = optimize_insn_for_speed_p ();
> >        do_pending_stack_adjust ();
> >        start_sequence ();
> > -      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> > -				   op0, op1, target, 1);
> > +      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 1);
> >        rtx_insn *uns_insns = get_insns ();
> >        end_sequence ();
> >        start_sequence ();
> > -      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> > -				   op0, op1, target, 0);
> > +      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 0);
> >        rtx_insn *sgn_insns = get_insns ();
> >        end_sequence ();
> >        unsigned uns_cost = seq_cost (uns_insns, speed_p);
> > @@ -9198,8 +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
> >        emit_insn (sgn_insns);
> >        return sgn_ret;
> >      }
> > -  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
> > -			op0, op1, target, unsignedp);
> > +  return expand_divmod (mod_p, code, mode, op0, op1, target, unsignedp);
> >  }
> >  
> >  rtx
> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > index 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052584f5a3b8a734baa800f 100644
> > --- a/gcc/internal-fn.def
> > +++ b/gcc/internal-fn.def
> > @@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL, ECF_CONST | ECF_NOTHROW, first,
> >  
> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST | ECF_NOTHROW, first,
> >  			      smul_highpart, umul_highpart, binary)
> > +DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST | ECF_NOTHROW, first,
> > +			      sadd_highpart, uadd_highpart, binary)
> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST | ECF_NOTHROW, first,
> >  			      smulhs, umulhs, binary)
> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | ECF_NOTHROW, first,
> > diff --git a/gcc/optabs.cc b/gcc/optabs.cc
> > index cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369a6e77082c1e617b 100644
> > --- a/gcc/optabs.cc
> > +++ b/gcc/optabs.cc
> > @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode mode, rtx op0, rtx op1, bool unsignedp)
> >  		return NULL_RTX;
> >  	    }
> >  	}
> > -      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, NULL, NULL,
> > -				     sum, gen_int_mode (INTVAL (op1),
> > -							word_mode),
> > +      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, sum,
> > +				     gen_int_mode (INTVAL (op1), word_mode),
> >  				     NULL_RTX, 1, OPTAB_DIRECT);
> >        if (remainder == NULL_RTX)
> >  	return NULL_RTX;
> > @@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
> >  
> >    if (op11 != const1_rtx)
> >      {
> > -      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL, NULL, quot1,
> > -				op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
> > +      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1, op11,
> > +				NULL_RTX, unsignedp, OPTAB_DIRECT);
> >        if (rem2 == NULL_RTX)
> >  	return NULL_RTX;
> >  
> > @@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
> >        if (rem2 == NULL_RTX)
> >  	return NULL_RTX;
> >  
> > -      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL, NULL, quot1,
> > -				 op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
> > +      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1, op11,
> > +				 NULL_RTX, unsignedp, OPTAB_DIRECT);
> >        if (quot2 == NULL_RTX)
> >  	return NULL_RTX;
> >  
> > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > index 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2d7a5ccbf6147947351a 100644
> > --- a/gcc/optabs.def
> > +++ b/gcc/optabs.def
> > @@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
> >  
> >  OPTAB_D (smul_highpart_optab, "smul$a3_highpart")
> >  OPTAB_D (umul_highpart_optab, "umul$a3_highpart")
> > +OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart")
> > +OPTAB_D (uadd_highpart_optab, "uadd$a3_highpart")
> >  
> >  OPTAB_D (cmpmem_optab, "cmpmem$a")
> >  OPTAB_D (cmpstr_optab, "cmpstr$a")
> > diff --git a/gcc/target.def b/gcc/target.def
> > index db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5 100644
> > --- a/gcc/target.def
> > +++ b/gcc/target.def
> > @@ -1905,25 +1905,6 @@ implementation approaches itself.",
> >  	const vec_perm_indices &sel),
> >   NULL)
> >  
> > -DEFHOOK
> > -(can_special_div_by_const,
> > - "This hook is used to test whether the target has a special method of\n\
> > -division of vectors of type @var{vectype} using the value @var{constant},\n\
> > -and producing a vector of type @var{vectype}.  The division\n\
> > -will then not be decomposed by the vectorizer and kept as a div.\n\
> > -\n\
> > -When the hook is being used to test whether the target supports a special\n\
> > -divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook\n\
> > -is being used to emit a division, @var{in0} and @var{in1} are the source\n\
> > -vectors of type @var{vecttype} and @var{output} is the destination vector of\n\
> > -type @var{vectype}.\n\
> > -\n\
> > -Return true if the operation is possible, emitting instructions for it\n\
> > -if rtxes are provided and updating @var{output}.",
> > - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
> > -	rtx in0, rtx in1),
> > - default_can_special_div_by_const)
> > -
> >  /* Return true if the target supports misaligned store/load of a
> >     specific factor denoted in the third parameter.  The last parameter
> >     is true if the access is defined in a packed struct.  */
> > diff --git a/gcc/target.h b/gcc/target.h
> > index 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b99f913158c2d47b1 100644
> > --- a/gcc/target.h
> > +++ b/gcc/target.h
> > @@ -51,7 +51,6 @@
> >  #include "insn-codes.h"
> >  #include "tm.h"
> >  #include "hard-reg-set.h"
> > -#include "tree-core.h"
> >  
> >  #if CHECKING_P
> >  
> > diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> > index a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad2244549317a31390f0c2 100644
> > --- a/gcc/targhooks.h
> > +++ b/gcc/targhooks.h
> > @@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage (addr_space_t, location_t);
> >  extern rtx default_addr_space_convert (rtx, tree, tree);
> >  extern unsigned int default_case_values_threshold (void);
> >  extern bool default_have_conditional_execution (void);
> > -extern bool default_can_special_div_by_const (enum tree_code, tree, wide_int,
> > -					      rtx *, rtx, rtx);
> >  
> >  extern bool default_libc_has_function (enum function_class, tree);
> >  extern bool default_libc_has_fast_function (int fcode);
> > diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> > index fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91e03877337a931e7 100644
> > --- a/gcc/targhooks.cc
> > +++ b/gcc/targhooks.cc
> > @@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
> >    return HAVE_conditional_execution;
> >  }
> >  
> > -/* Default that no division by constant operations are special.  */
> > -bool
> > -default_can_special_div_by_const (enum tree_code, tree, wide_int, rtx *, rtx,
> > -				  rtx)
> > -{
> > -  return false;
> > -}
> > -
> >  /* By default we assume that c99 functions are present at the runtime,
> >     but sincos is not.  */
> >  bool
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> > new file mode 100644
> > index 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0a04ea8c1f73e3c
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> > @@ -0,0 +1,25 @@
> > +/* { dg-require-effective-target vect_int } */
> > +
> > +#include <stdint.h>
> > +#include "tree-vect.h"
> > +
> > +typedef unsigned __attribute__((__vector_size__ (16))) V;
> > +
> > +static __attribute__((__noinline__)) __attribute__((__noclone__)) V
> > +foo (V v, unsigned short i)
> > +{
> > +  v /= i;
> > +  return v;
> > +}
> > +
> > +int
> > +main (void)
> > +{
> > +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }, 0xffff);
> > +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
> > +    if (v[i] != 0x00010001)
> > +      __builtin_abort ();
> > +  return 0;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern: detected" "vect" { target aarch64*-*-* } } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> > new file mode 100644
> > index 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b49914d2a29b933de625
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> > @@ -0,0 +1,58 @@
> > +/* { dg-require-effective-target vect_int } */
> > +
> > +#include <stdint.h>
> > +#include <stdio.h>
> > +#include "tree-vect.h"
> > +
> > +#define N 50
> > +#define TYPE uint8_t 
> > +
> > +#ifndef DEBUG
> > +#define DEBUG 0
> > +#endif
> > +
> > +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> > +
> > +
> > +__attribute__((noipa, noinline, optimize("O1")))
> > +void fun1(TYPE* restrict pixel, TYPE level, int n)
> > +{
> > +  for (int i = 0; i < n; i+=1)
> > +    pixel[i] = (pixel[i] + level) / 0xff;
> > +}
> > +
> > +__attribute__((noipa, noinline, optimize("O3")))
> > +void fun2(TYPE* restrict pixel, TYPE level, int n)
> > +{
> > +  for (int i = 0; i < n; i+=1)
> > +    pixel[i] = (pixel[i] + level) / 0xff;
> > +}
> > +
> > +int main ()
> > +{
> > +  TYPE a[N];
> > +  TYPE b[N];
> > +
> > +  for (int i = 0; i < N; ++i)
> > +    {
> > +      a[i] = BASE + i * 13;
> > +      b[i] = BASE + i * 13;
> > +      if (DEBUG)
> > +        printf ("%d: 0x%x\n", i, a[i]);
> > +    }
> > +
> > +  fun1 (a, N / 2, N);
> > +  fun2 (b, N / 2, N);
> > +
> > +  for (int i = 0; i < N; ++i)
> > +    {
> > +      if (DEBUG)
> > +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> > +
> > +      if (a[i] != b[i])
> > +        __builtin_abort ();
> > +    }
> > +  return 0;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" { target aarch64*-*-* } } } */
> > diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
> > index 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077dc3e970bed75ef6 100644
> > --- a/gcc/tree-vect-generic.cc
> > +++ b/gcc/tree-vect-generic.cc
> > @@ -1237,17 +1237,6 @@ expand_vector_operation (gimple_stmt_iterator *gsi, tree type, tree compute_type
> >  	  tree rhs2 = gimple_assign_rhs2 (assign);
> >  	  tree ret;
> >  
> > -	  /* Check if the target was going to handle it through the special
> > -	     division callback hook.  */
> > -	  tree cst = uniform_integer_cst_p (rhs2);
> > -	  if (cst &&
> > -	      targetm.vectorize.can_special_div_by_const (code, type,
> > -							  wi::to_wide (cst),
> > -							  NULL,
> > -							  NULL_RTX, NULL_RTX))
> > -	    return NULL_TREE;
> > -
> > -
> >  	  if (!optimize
> >  	      || !VECTOR_INTEGER_TYPE_P (type)
> >  	      || TREE_CODE (rhs2) != VECTOR_CST
> > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> > index 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f3369de2afea139d6 100644
> > --- a/gcc/tree-vect-patterns.cc
> > +++ b/gcc/tree-vect-patterns.cc
> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info *vinfo,
> >        return pattern_stmt;
> >      }
> >    else if ((cst = uniform_integer_cst_p (oprnd1))
> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code, vectype,
> > -							  wi::to_wide (cst),
> > -							  NULL, NULL_RTX,
> > -							  NULL_RTX))
> > +	   && TYPE_UNSIGNED (itype)
> > +	   && rhs_code == TRUNC_DIV_EXPR
> > +	   && vectype
> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> > +					      OPTIMIZE_FOR_SPEED))
> >      {
> > -      return NULL;
> > +      /* div optimizations using narrowings
> > +       we can do the division e.g. shorts by 255 faster by calculating it as
> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> > +       double the precision of x.
> > +
> > +       If we imagine a short as being composed of two blocks of bytes then
> > +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
> > +       adding 1 to each sub component:
> > +
> > +	    short value of 16-bits
> > +       ┌──────────────┬────────────────┐
> > +       │              │                │
> > +       └──────────────┴────────────────┘
> > +	 8-bit part1 ▲  8-bit part2   ▲
> > +		     │                │
> > +		     │                │
> > +		    +1               +1
> > +
> > +       after the first addition, we have to shift right by 8, and narrow the
> > +       results back to a byte.  Remember that the addition must be done in
> > +       double the precision of the input.  However if we know that the addition
> > +       `x + 257` does not overflow then we can do the operation in the current
> > +       precision.  In which case we don't need the pack and unpacks.  */
> > +      auto wcst = wi::to_wide (cst);
> > +      int pow = wi::exact_log2 (wcst + 1);
> > +      if (pow == (int) (element_precision (vectype) / 2))
> > +	{
> > +	  wide_int min,max;
> > +	  /* If we're in a pattern we need to find the orginal definition.  */
> > +	  tree op0 = oprnd0;
> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> > +	  if (is_pattern_stmt_p (stmt_info))
> > +	    {
> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
> > +	    }
> 
> If this is generally safe (I'm skipping thinking about it in the
> interests of a quick review :-)), then I think it should be done in
> vect_get_range_info instead.  Using gimple_get_lhs would be more
> general than handling just assignments.
> 
> > +
> > +	  /* Check that no overflow will occur.  If we don't have range
> > +	     information we can't perform the optimization.  */
> > +	  if (vect_get_range_info (op0, &min, &max))
> > +	    {
> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> > +	      wi::overflow_type ovf;
> > +	      /* We need adder and max in the same precision.  */
> > +	      wide_int zadder
> > +		= wide_int_storage::from (adder, wi::get_precision (max),
> > +					  UNSIGNED);
> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
> 
> Could you explain this a bit more?  When do we have mismatched precisions?
> 
> Thanks,
> Richard
> 
> > +	      if (ovf == wi::OVF_NONE)
> > +		{
> > +		  *type_out = vectype;
> > +		  tree tadder = wide_int_to_tree (itype, adder);
> > +		  gcall *patt1
> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, tadder);
> > +		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
> > +		  gimple_call_set_lhs (patt1, lhs);
> > +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
> > +
> > +		  pattern_stmt
> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
> > +		  lhs = vect_recog_temp_ssa_var (itype, NULL);
> > +		  gimple_call_set_lhs (pattern_stmt, lhs);
> > +
> > +		  return pattern_stmt;
> > +		}
> > +	    }
> > +	}
> >      }
> >  
> >    if (prec > HOST_BITS_PER_WIDE_INT
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b9564fc4e066e50081 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
> >  	}
> >        target_support_p = (optab_handler (optab, vec_mode)
> >  			  != CODE_FOR_nothing);
> > -      tree cst;
> > -      if (!target_support_p
> > -	  && op1
> > -	  && (cst = uniform_integer_cst_p (op1)))
> > -	target_support_p
> > -	  = targetm.vectorize.can_special_div_by_const (code, vectype,
> > -							wi::to_wide (cst),
> > -							NULL, NULL_RTX,
> > -							NULL_RTX);
> >      }
> >  
> >    bool using_emulated_vectors_p = vect_emulated_vector_p (vectype);
>
Tamar Christina Feb. 10, 2023, 2:13 p.m. UTC | #5
> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Friday, February 10, 2023 1:36 PM
> To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>
> Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>;
> rguenther@suse.de; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> I think I'm misunderstanding, but: it seems like we're treating the add
> highpart optabs as companions to the mul highpart optabs.  But AIUI, the add
> highpart optab is used such that, for an N-bit mode, we do an N-bit addition
> followed by a shift by N/2.  Is that right?
> The mul highpart optabs instead do an 2N-bit multiplication followed by a
> shift by N.

Correct.

> 
> Apart from consistency, the reason this matters is: I'm not sure what we gain
> by adding the optab rather than simply open-coding the addition and the
> shift directly into the vector pattern.  It seems like the AArch64 expander in
> 2/2 does just do an ordinary N-bit addition followed by an ordinary shift by
> N/2.

I mentioned in the implementation, but I did so because AArch64 has various
optimization on shifts when it comes to truncating results.  I didn't need to
represent it with shifts, in fact the original pattern did not. But representing it
directly in the final instructions are problematic because these instructions are
unspecs and I would have needed to provide additional optabs to optimize them in.

So the shift representation was more natural for AArch64. It would not be say for
AArch32 which does not have these optimizations already. SVE has similar optimizations
and at the very worse you get an usra.

I avoided open coding it with add and shift because it creates a 4 instructions (and shifts
which are typically slow) dependency chain instead of a load and multiply.  This change,
unless the target is known to optimize it further is unlikely to be beneficial.  And by the
time we get to costing the only alternative is to undo the existing pattern and so you lose
the general shift optimization.

So it seemed unwise to open code as shifts, given the codegen out of the vectorizer would
be degenerate for most targets or one needs the more complicated route of costing during
pattern matching already.

> 
> Some comments in addition to Richard's:
> 
> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > Hi All,
> >
> > As discussed in the ticket, this replaces the approach for optimizing
> > the div by bitmask operation from a hook into optabs implemented
> > through add_highpart.
> >
> > In order to be able to use this we need to check whether the current
> > precision has enough bits to do the operation without any of the additions
> overflowing.
> >
> > We use range information to determine this and only do the operation
> > if we're sure am overflow won't occur.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going>
> issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	PR target/108583
> > 	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST):
> Remove.
> > 	* doc/tm.texi.in: Likewise.
> > 	* explow.cc (round_push, align_dynamic_address): Revert previous
> patch.
> > 	* expmed.cc (expand_divmod): Likewise.
> > 	* expmed.h (expand_divmod): Likewise.
> > 	* expr.cc (force_operand, expand_expr_divmod): Likewise.
> > 	* optabs.cc (expand_doubleword_mod,
> expand_doubleword_divmod): Likewise.
> > 	* internal-fn.def (ADDH): New.
> > 	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
> > 	* doc/md.texi: Document them.
> > 	* doc/rtl.texi: Likewise.
> > 	* target.def (can_special_div_by_const): Remove.
> > 	* target.h: Remove tree-core.h include
> > 	* targhooks.cc (default_can_special_div_by_const): Remove.
> > 	* targhooks.h (default_can_special_div_by_const): Remove.
> > 	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
> > 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook
> and
> > 	implement new obtab recognition based on range.
> > 	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 	PR target/108583
> > 	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
> > 	* gcc.dg/vect/vect-div-bitmask-5.c: New test.
> >
> > --- inline copy of patch --
> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index
> >
> 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f74080
> 3
> > 8595e21af35d 100644
> > --- a/gcc/doc/md.texi
> > +++ b/gcc/doc/md.texi
> > @@ -5668,6 +5668,18 @@ represented in RTL using a
> @code{smul_highpart} RTX expression.
> >  Similar, but the multiplication is unsigned.  This may be represented
> > in RTL using an @code{umul_highpart} RTX expression.
> >
> > +@cindex @code{sadd@var{m}3_highpart} instruction pattern @item
> > +@samp{smul@var{m}3_highpart}
> 
> sadd
> 
> > +Perform a signed addition of operands 1 and 2, which have mode
> > +@var{m}, and store the most significant half of the product in operand 0.
> > +The least significant half of the product is discarded.  This may be
> > +represented in RTL using a @code{sadd_highpart} RTX expression.
> > +
> > +@cindex @code{uadd@var{m}3_highpart} instruction pattern @item
> > +@samp{uadd@var{m}3_highpart} Similar, but the addition is unsigned.
> > +This may be represented in RTL using an @code{uadd_highpart} RTX
> > +expression.
> > +
> >  @cindex @code{madd@var{m}@var{n}4} instruction pattern  @item
> > @samp{madd@var{m}@var{n}4}  Multiply operands 1 and 2, sign-extend
> > them to mode @var{n}, add diff --git a/gcc/doc/rtl.texi
> > b/gcc/doc/rtl.texi index
> >
> d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00343
> d17
> > 1940ec4222f3 100644
> > --- a/gcc/doc/rtl.texi
> > +++ b/gcc/doc/rtl.texi
> > @@ -2535,6 +2535,17 @@ out in machine mode @var{m}.
> > @code{smul_highpart} returns the high part  of a signed
> > multiplication, @code{umul_highpart} returns the high part  of an unsigned
> multiplication.
> >
> > +@findex sadd_highpart
> > +@findex uadd_highpart
> > +@cindex high-part addition
> > +@cindex addition high part
> > +@item (sadd_highpart:@var{m} @var{x} @var{y}) @itemx
> > +(uadd_highpart:@var{m} @var{x} @var{y}) Represents the high-part
> > +addition of @var{x} and @var{y} carried out in machine mode @var{m}.
> > +@code{sadd_highpart} returns the high part of a signed addition,
> > +@code{uadd_highpart} returns the high part of an unsigned addition.
> 
> The patch doesn't add these RTL codes though.
> 
> > +
> >  @findex fma
> >  @cindex fused multiply-add
> >  @item (fma:@var{m} @var{x} @var{y} @var{z}) diff --git
> > a/gcc/doc/tm.texi b/gcc/doc/tm.texi index
> >
> c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d57914840
> 17e
> > 6b0d62ab077e 100644
> > --- a/gcc/doc/tm.texi
> > +++ b/gcc/doc/tm.texi
> > @@ -6137,22 +6137,6 @@ instruction pattern.  There is no need for the
> > hook to handle these two  implementation approaches itself.
> >  @end deftypefn
> >
> > -@deftypefn {Target Hook} bool
> > TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum
> @var{tree_code}, tree
> > @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx
> > @var{in0}, rtx @var{in1}) -This hook is used to test whether the
> > target has a special method of -division of vectors of type @var{vectype}
> using the value @var{constant}, -and producing a vector of type
> @var{vectype}.  The division -will then not be decomposed by the vectorizer
> and kept as a div.
> > -
> > -When the hook is being used to test whether the target supports a
> > special -divide, @var{in0}, @var{in1}, and @var{output} are all null.
> > When the hook -is being used to emit a division, @var{in0} and
> > @var{in1} are the source -vectors of type @var{vecttype} and
> > @var{output} is the destination vector of -type @var{vectype}.
> > -
> > -Return true if the operation is possible, emitting instructions for
> > it -if rtxes are provided and updating @var{output}.
> > -@end deftypefn
> > -
> >  @deftypefn {Target Hook} tree
> > TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned
> @var{code},
> > tree @var{vec_type_out}, tree @var{vec_type_in})  This hook should
> > return the decl of a function that implements the  vectorized variant
> > of the function with the @code{combined_fn} code diff --git
> > a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index
> >
> 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d1efec0
> a3a
> > bccd1c293c7b 100644
> > --- a/gcc/doc/tm.texi.in
> > +++ b/gcc/doc/tm.texi.in
> > @@ -4173,8 +4173,6 @@ address;  but often a machine-dependent
> strategy can generate better code.
> >
> >  @hook TARGET_VECTORIZE_VEC_PERM_CONST
> >
> > -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
> > -
> >  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
> >
> >  @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
> > diff --git a/gcc/explow.cc b/gcc/explow.cc index
> >
> 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0
> bef
> > a016eea4573c 100644
> > --- a/gcc/explow.cc
> > +++ b/gcc/explow.cc
> > @@ -1037,7 +1037,7 @@ round_push (rtx size)
> >       TRUNC_DIV_EXPR.  */
> >    size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
> >  		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
> > -  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, size,
> > align_rtx,
> > +  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
> >  			NULL_RTX, 1);
> >    size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
> >
> > @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned
> required_align)
> >  			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
> >  				       Pmode),
> >  			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
> > -  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
> > target,
> > +  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
> >  			  gen_int_mode (required_align / BITS_PER_UNIT,
> >  					Pmode),
> >  			  NULL_RTX, 1);
> > diff --git a/gcc/expmed.h b/gcc/expmed.h index
> >
> 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c5364
> 094
> > 1628068f3901 100644
> > --- a/gcc/expmed.h
> > +++ b/gcc/expmed.h
> > @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code,
> > machine_mode, rtx, poly_int64, rtx,  extern rtx maybe_expand_shift
> (enum tree_code, machine_mode, rtx, int, rtx,
> >  			       int);
> >  #ifdef GCC_OPTABS_H
> > -extern rtx expand_divmod (int, enum tree_code, machine_mode, tree,
> tree,
> > -			  rtx, rtx, rtx, int,
> > -			  enum optab_methods = OPTAB_LIB_WIDEN);
> > +extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx,
> rtx,
> > +			  rtx, int, enum optab_methods =
> OPTAB_LIB_WIDEN);
> >  #endif
> >  #endif
> >
> > diff --git a/gcc/expmed.cc b/gcc/expmed.cc index
> >
> 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3
> a59
> > c169d3b7692f 100644
> > --- a/gcc/expmed.cc
> > +++ b/gcc/expmed.cc
> > @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode, rtx
> op0,
> > HOST_WIDE_INT d)
> >
> >  rtx
> >  expand_divmod (int rem_flag, enum tree_code code, machine_mode
> mode,
> > -	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
> > -	       int unsignedp, enum optab_methods methods)
> > +	       rtx op0, rtx op1, rtx target, int unsignedp,
> > +	       enum optab_methods methods)
> >  {
> >    machine_mode compute_mode;
> >    rtx tquotient;
> > @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum tree_code
> > code, machine_mode mode,
> >
> >    last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) : 0;
> >
> > -  /* Check if the target has specific expansions for the division.
> > */
> > -  tree cst;
> > -  if (treeop0
> > -      && treeop1
> > -      && (cst = uniform_integer_cst_p (treeop1))
> > -      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE
> (treeop0),
> > -						     wi::to_wide (cst),
> > -						     &target, op0, op1))
> > -    return target;
> > -
> > -
> >    /* Now convert to the best mode to use.  */
> >    if (compute_mode != mode)
> >      {
> > @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum tree_code
> code, machine_mode mode,
> >  			    || (optab_handler (sdivmod_optab, int_mode)
> >  				!= CODE_FOR_nothing)))
> >  		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
> > -						int_mode, treeop0, treeop1,
> > -						op0, gen_int_mode (abs_d,
> > +						int_mode, op0,
> > +						gen_int_mode (abs_d,
> >  							      int_mode),
> >  						NULL_RTX, 0);
> >  		    else
> > @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum tree_code
> code, machine_mode mode,
> >  				      size - 1, NULL_RTX, 0);
> >  		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
> >  				    NULL_RTX);
> > -		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode,
> treeop0,
> > -				    treeop1, t3, op1, NULL_RTX, 0);
> > +		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3,
> op1,
> > +				    NULL_RTX, 0);
> >  		if (t4)
> >  		  {
> >  		    rtx t5;
> > diff --git a/gcc/expr.cc b/gcc/expr.cc index
> >
> 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b
> 2280
> > c6e277f26d72 100644
> > --- a/gcc/expr.cc
> > +++ b/gcc/expr.cc
> > @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
> >  	    return expand_divmod (0,
> >  				  FLOAT_MODE_P (GET_MODE (value))
> >  				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
> > -				  GET_MODE (value), NULL, NULL, op1, op2,
> > -				  target, 0);
> > +				  GET_MODE (value), op1, op2, target, 0);
> >  	case MOD:
> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> NULL, NULL,
> > -				op1, op2, target, 0);
> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> op1, op2,
> > +				target, 0);
> >  	case UDIV:
> > -	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
> NULL, NULL,
> > -				op1, op2, target, 1);
> > +	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
> op1, op2,
> > +				target, 1);
> >  	case UMOD:
> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> NULL, NULL,
> > -				op1, op2, target, 1);
> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> op1, op2,
> > +				target, 1);
> >  	case ASHIFTRT:
> >  	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
> >  				      target, 0, OPTAB_LIB_WIDEN); @@ -
> 9170,13 +9169,11 @@
> > expand_expr_divmod (tree_code code, machine_mode mode, tree
> treeop0,
> >        bool speed_p = optimize_insn_for_speed_p ();
> >        do_pending_stack_adjust ();
> >        start_sequence ();
> > -      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> > -				   op0, op1, target, 1);
> > +      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1,
> > + target, 1);
> >        rtx_insn *uns_insns = get_insns ();
> >        end_sequence ();
> >        start_sequence ();
> > -      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> > -				   op0, op1, target, 0);
> > +      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1,
> > + target, 0);
> >        rtx_insn *sgn_insns = get_insns ();
> >        end_sequence ();
> >        unsigned uns_cost = seq_cost (uns_insns, speed_p); @@ -9198,8
> > +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode
> mode, tree treeop0,
> >        emit_insn (sgn_insns);
> >        return sgn_ret;
> >      }
> > -  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
> > -			op0, op1, target, unsignedp);
> > +  return expand_divmod (mod_p, code, mode, op0, op1, target,
> > + unsignedp);
> >  }
> >
> >  rtx
> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
> >
> 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052584f5a
> 3b
> > 8a734baa800f 100644
> > --- a/gcc/internal-fn.def
> > +++ b/gcc/internal-fn.def
> > @@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL,
> ECF_CONST
> > | ECF_NOTHROW, first,
> >
> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST |
> ECF_NOTHROW, first,
> >  			      smul_highpart, umul_highpart, binary)
> > +DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST |
> ECF_NOTHROW, first,
> > +			      sadd_highpart, uadd_highpart, binary)
> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST |
> ECF_NOTHROW, first,
> >  			      smulhs, umulhs, binary)
> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST |
> ECF_NOTHROW, first,
> > diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
> >
> cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369a6
> e
> > 77082c1e617b 100644
> > --- a/gcc/optabs.cc
> > +++ b/gcc/optabs.cc
> > @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode
> mode, rtx op0, rtx op1, bool unsignedp)
> >  		return NULL_RTX;
> >  	    }
> >  	}
> > -      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode,
> NULL, NULL,
> > -				     sum, gen_int_mode (INTVAL (op1),
> > -							word_mode),
> > +      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode,
> sum,
> > +				     gen_int_mode (INTVAL (op1),
> word_mode),
> >  				     NULL_RTX, 1, OPTAB_DIRECT);
> >        if (remainder == NULL_RTX)
> >  	return NULL_RTX;
> > @@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode
> mode, rtx
> > op0, rtx op1, rtx *rem,
> >
> >    if (op11 != const1_rtx)
> >      {
> > -      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL, NULL,
> quot1,
> > -				op11, NULL_RTX, unsignedp,
> OPTAB_DIRECT);
> > +      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1,
> op11,
> > +				NULL_RTX, unsignedp, OPTAB_DIRECT);
> >        if (rem2 == NULL_RTX)
> >  	return NULL_RTX;
> >
> > @@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode
> mode, rtx op0, rtx op1, rtx *rem,
> >        if (rem2 == NULL_RTX)
> >  	return NULL_RTX;
> >
> > -      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL, NULL,
> quot1,
> > -				 op11, NULL_RTX, unsignedp,
> OPTAB_DIRECT);
> > +      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1, op11,
> > +				 NULL_RTX, unsignedp, OPTAB_DIRECT);
> >        if (quot2 == NULL_RTX)
> >  	return NULL_RTX;
> >
> > diff --git a/gcc/optabs.def b/gcc/optabs.def index
> >
> 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2d7a5
> ccb
> > f6147947351a 100644
> > --- a/gcc/optabs.def
> > +++ b/gcc/optabs.def
> > @@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
> >
> >  OPTAB_D (smul_highpart_optab, "smul$a3_highpart")  OPTAB_D
> > (umul_highpart_optab, "umul$a3_highpart")
> > +OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart") OPTAB_D
> > +(uadd_highpart_optab, "uadd$a3_highpart")
> >
> >  OPTAB_D (cmpmem_optab, "cmpmem$a")
> >  OPTAB_D (cmpstr_optab, "cmpstr$a")
> > diff --git a/gcc/target.def b/gcc/target.def index
> >
> db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d
> 81a
> > fa2c2baa64a5 100644
> > --- a/gcc/target.def
> > +++ b/gcc/target.def
> > @@ -1905,25 +1905,6 @@ implementation approaches itself.",
> >  	const vec_perm_indices &sel),
> >   NULL)
> >
> > -DEFHOOK
> > -(can_special_div_by_const,
> > - "This hook is used to test whether the target has a special method
> > of\n\ -division of vectors of type @var{vectype} using the value
> > @var{constant},\n\ -and producing a vector of type @var{vectype}.  The
> > division\n\ -will then not be decomposed by the vectorizer and kept as
> > a div.\n\ -\n\ -When the hook is being used to test whether the target
> > supports a special\n\ -divide, @var{in0}, @var{in1}, and @var{output}
> > are all null.  When the hook\n\ -is being used to emit a division,
> > @var{in0} and @var{in1} are the source\n\ -vectors of type
> > @var{vecttype} and @var{output} is the destination vector of\n\ -type
> > @var{vectype}.\n\ -\n\ -Return true if the operation is possible,
> > emitting instructions for it\n\ -if rtxes are provided and updating
> > @var{output}.",
> > - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
> > -	rtx in0, rtx in1),
> > - default_can_special_div_by_const)
> > -
> >  /* Return true if the target supports misaligned store/load of a
> >     specific factor denoted in the third parameter.  The last parameter
> >     is true if the access is defined in a packed struct.  */ diff
> > --git a/gcc/target.h b/gcc/target.h index
> >
> 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b9
> 9f9
> > 13158c2d47b1 100644
> > --- a/gcc/target.h
> > +++ b/gcc/target.h
> > @@ -51,7 +51,6 @@
> >  #include "insn-codes.h"
> >  #include "tm.h"
> >  #include "hard-reg-set.h"
> > -#include "tree-core.h"
> >
> >  #if CHECKING_P
> >
> > diff --git a/gcc/targhooks.h b/gcc/targhooks.h index
> >
> a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad224454
> 93
> > 17a31390f0c2 100644
> > --- a/gcc/targhooks.h
> > +++ b/gcc/targhooks.h
> > @@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage
> > (addr_space_t, location_t);  extern rtx default_addr_space_convert
> > (rtx, tree, tree);  extern unsigned int default_case_values_threshold
> > (void);  extern bool default_have_conditional_execution (void);
> > -extern bool default_can_special_div_by_const (enum tree_code, tree,
> wide_int,
> > -					      rtx *, rtx, rtx);
> >
> >  extern bool default_libc_has_function (enum function_class, tree);
> > extern bool default_libc_has_fast_function (int fcode); diff --git
> > a/gcc/targhooks.cc b/gcc/targhooks.cc index
> >
> fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91e
> 03
> > 877337a931e7 100644
> > --- a/gcc/targhooks.cc
> > +++ b/gcc/targhooks.cc
> > @@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
> >    return HAVE_conditional_execution;
> >  }
> >
> > -/* Default that no division by constant operations are special.  */
> > -bool -default_can_special_div_by_const (enum tree_code, tree,
> > wide_int, rtx *, rtx,
> > -				  rtx)
> > -{
> > -  return false;
> > -}
> > -
> >  /* By default we assume that c99 functions are present at the runtime,
> >     but sincos is not.  */
> >  bool
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0
> a0
> > 4ea8c1f73e3c
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> > @@ -0,0 +1,25 @@
> > +/* { dg-require-effective-target vect_int } */
> > +
> > +#include <stdint.h>
> > +#include "tree-vect.h"
> > +
> > +typedef unsigned __attribute__((__vector_size__ (16))) V;
> > +
> > +static __attribute__((__noinline__)) __attribute__((__noclone__)) V
> > +foo (V v, unsigned short i) {
> > +  v /= i;
> > +  return v;
> > +}
> > +
> > +int
> > +main (void)
> > +{
> > +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff },
> > +0xffff);
> > +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
> > +    if (v[i] != 0x00010001)
> > +      __builtin_abort ();
> > +  return 0;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern:
> > +detected" "vect" { target aarch64*-*-* } } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b4991
> 4d2
> > a29b933de625
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> > @@ -0,0 +1,58 @@
> > +/* { dg-require-effective-target vect_int } */
> > +
> > +#include <stdint.h>
> > +#include <stdio.h>
> > +#include "tree-vect.h"
> > +
> > +#define N 50
> > +#define TYPE uint8_t
> > +
> > +#ifndef DEBUG
> > +#define DEBUG 0
> > +#endif
> > +
> > +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> > +
> > +
> > +__attribute__((noipa, noinline, optimize("O1"))) void fun1(TYPE*
> > +restrict pixel, TYPE level, int n) {
> > +  for (int i = 0; i < n; i+=1)
> > +    pixel[i] = (pixel[i] + level) / 0xff; }
> > +
> > +__attribute__((noipa, noinline, optimize("O3"))) void fun2(TYPE*
> > +restrict pixel, TYPE level, int n) {
> > +  for (int i = 0; i < n; i+=1)
> > +    pixel[i] = (pixel[i] + level) / 0xff; }
> > +
> > +int main ()
> > +{
> > +  TYPE a[N];
> > +  TYPE b[N];
> > +
> > +  for (int i = 0; i < N; ++i)
> > +    {
> > +      a[i] = BASE + i * 13;
> > +      b[i] = BASE + i * 13;
> > +      if (DEBUG)
> > +        printf ("%d: 0x%x\n", i, a[i]);
> > +    }
> > +
> > +  fun1 (a, N / 2, N);
> > +  fun2 (b, N / 2, N);
> > +
> > +  for (int i = 0; i < N; ++i)
> > +    {
> > +      if (DEBUG)
> > +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> > +
> > +      if (a[i] != b[i])
> > +        __builtin_abort ();
> > +    }
> > +  return 0;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" {
> > +target aarch64*-*-* } } } */
> > diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc index
> >
> 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077d
> c3
> > e970bed75ef6 100644
> > --- a/gcc/tree-vect-generic.cc
> > +++ b/gcc/tree-vect-generic.cc
> > @@ -1237,17 +1237,6 @@ expand_vector_operation
> (gimple_stmt_iterator *gsi, tree type, tree compute_type
> >  	  tree rhs2 = gimple_assign_rhs2 (assign);
> >  	  tree ret;
> >
> > -	  /* Check if the target was going to handle it through the special
> > -	     division callback hook.  */
> > -	  tree cst = uniform_integer_cst_p (rhs2);
> > -	  if (cst &&
> > -	      targetm.vectorize.can_special_div_by_const (code, type,
> > -							  wi::to_wide (cst),
> > -							  NULL,
> > -							  NULL_RTX,
> NULL_RTX))
> > -	    return NULL_TREE;
> > -
> > -
> >  	  if (!optimize
> >  	      || !VECTOR_INTEGER_TYPE_P (type)
> >  	      || TREE_CODE (rhs2) != VECTOR_CST diff --git
> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
> >
> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
> 69
> > de2afea139d6 100644
> > --- a/gcc/tree-vect-patterns.cc
> > +++ b/gcc/tree-vect-patterns.cc
> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info *vinfo,
> >        return pattern_stmt;
> >      }
> >    else if ((cst = uniform_integer_cst_p (oprnd1))
> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
> vectype,
> > -							  wi::to_wide (cst),
> > -							  NULL, NULL_RTX,
> > -							  NULL_RTX))
> > +	   && TYPE_UNSIGNED (itype)
> > +	   && rhs_code == TRUNC_DIV_EXPR
> > +	   && vectype
> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> > +					      OPTIMIZE_FOR_SPEED))
> >      {
> > -      return NULL;
> > +      /* div optimizations using narrowings
> > +       we can do the division e.g. shorts by 255 faster by calculating it as
> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> > +       double the precision of x.
> > +
> > +       If we imagine a short as being composed of two blocks of bytes then
> > +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
> > +       adding 1 to each sub component:
> > +
> > +	    short value of 16-bits
> > +       ┌──────────────┬────────────────┐
> > +       │              │                │
> > +       └──────────────┴────────────────┘
> > +	 8-bit part1 ▲  8-bit part2   ▲
> > +		     │                │
> > +		     │                │
> > +		    +1               +1
> > +
> > +       after the first addition, we have to shift right by 8, and narrow the
> > +       results back to a byte.  Remember that the addition must be done in
> > +       double the precision of the input.  However if we know that the
> addition
> > +       `x + 257` does not overflow then we can do the operation in the
> current
> > +       precision.  In which case we don't need the pack and unpacks.  */
> > +      auto wcst = wi::to_wide (cst);
> > +      int pow = wi::exact_log2 (wcst + 1);
> > +      if (pow == (int) (element_precision (vectype) / 2))
> > +	{
> > +	  wide_int min,max;
> > +	  /* If we're in a pattern we need to find the orginal definition.  */
> > +	  tree op0 = oprnd0;
> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> > +	  if (is_pattern_stmt_p (stmt_info))
> > +	    {
> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
> > +	    }
> 
> If this is generally safe (I'm skipping thinking about it in the interests of a
> quick review :-)), then I think it should be done in vect_get_range_info
> instead.  Using gimple_get_lhs would be more general than handling just
> assignments.
> 
> > +
> > +	  /* Check that no overflow will occur.  If we don't have range
> > +	     information we can't perform the optimization.  */
> > +	  if (vect_get_range_info (op0, &min, &max))
> > +	    {
> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> > +	      wi::overflow_type ovf;
> > +	      /* We need adder and max in the same precision.  */
> > +	      wide_int zadder
> > +		= wide_int_storage::from (adder, wi::get_precision (max),
> > +					  UNSIGNED);
> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
> 
> Could you explain this a bit more?  When do we have mismatched
> precisions?

C promotion rules will promote e.g.

void fun2(uint8_t* restrict pixel, uint8_t level, int n)
{
  for (int i = 0; i < n; i+=1)
    pixel[i] = (pixel[i] + level) / 0xff;
}

And have the addition be done as a 32 bit integer.  The vectorizer will demote this down
to a short, but range information is not stored for patterns.  So In the above the range will
correctly be 0x1fe but the precision will be that of the original expression, so 32.  This will
be a mismatch with itype which is derived from the size the vectorizer will perform the
operation in.

Thanks,
Tamar

> 
> Thanks,
> Richard
> 
> > +	      if (ovf == wi::OVF_NONE)
> > +		{
> > +		  *type_out = vectype;
> > +		  tree tadder = wide_int_to_tree (itype, adder);
> > +		  gcall *patt1
> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0,
> tadder);
> > +		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
> > +		  gimple_call_set_lhs (patt1, lhs);
> > +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1,
> vectype);
> > +
> > +		  pattern_stmt
> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
> > +		  lhs = vect_recog_temp_ssa_var (itype, NULL);
> > +		  gimple_call_set_lhs (pattern_stmt, lhs);
> > +
> > +		  return pattern_stmt;
> > +		}
> > +	    }
> > +	}
> >      }
> >
> >    if (prec > HOST_BITS_PER_WIDE_INT
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index
> >
> eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b95
> 64f
> > c4e066e50081 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
> >  	}
> >        target_support_p = (optab_handler (optab, vec_mode)
> >  			  != CODE_FOR_nothing);
> > -      tree cst;
> > -      if (!target_support_p
> > -	  && op1
> > -	  && (cst = uniform_integer_cst_p (op1)))
> > -	target_support_p
> > -	  = targetm.vectorize.can_special_div_by_const (code, vectype,
> > -							wi::to_wide (cst),
> > -							NULL, NULL_RTX,
> > -							NULL_RTX);
> >      }
> >
> >    bool using_emulated_vectors_p = vect_emulated_vector_p (vectype);
Richard Sandiford Feb. 10, 2023, 2:30 p.m. UTC | #6
Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Friday, February 10, 2023 1:36 PM
>> To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>
>> Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>;
>> rguenther@suse.de; jlaw@ventanamicro.com
>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
>> by using new optabs [PR108583]
>> 
>> I think I'm misunderstanding, but: it seems like we're treating the add
>> highpart optabs as companions to the mul highpart optabs.  But AIUI, the add
>> highpart optab is used such that, for an N-bit mode, we do an N-bit addition
>> followed by a shift by N/2.  Is that right?
>> The mul highpart optabs instead do an 2N-bit multiplication followed by a
>> shift by N.
>
> Correct.
>
>> 
>> Apart from consistency, the reason this matters is: I'm not sure what we gain
>> by adding the optab rather than simply open-coding the addition and the
>> shift directly into the vector pattern.  It seems like the AArch64 expander in
>> 2/2 does just do an ordinary N-bit addition followed by an ordinary shift by
>> N/2.
>
> I mentioned in the implementation, but I did so because AArch64 has various
> optimization on shifts when it comes to truncating results.  I didn't need to
> represent it with shifts, in fact the original pattern did not. But representing it
> directly in the final instructions are problematic because these instructions are
> unspecs and I would have needed to provide additional optabs to optimize them in.
>
> So the shift representation was more natural for AArch64. It would not be say for
> AArch32 which does not have these optimizations already. SVE has similar optimizations
> and at the very worse you get an usra.
>
> I avoided open coding it with add and shift because it creates a 4 instructions (and shifts
> which are typically slow) dependency chain instead of a load and multiply.  This change,
> unless the target is known to optimize it further is unlikely to be beneficial.  And by the
> time we get to costing the only alternative is to undo the existing pattern and so you lose
> the general shift optimization.
>
> So it seemed unwise to open code as shifts, given the codegen out of the vectorizer would
> be degenerate for most targets or one needs the more complicated route of costing during
> pattern matching already.

Hmm, OK.  That seems like a cost-model thing though, rather than
something that should be exposed through optabs.  And I imagine
the open-coded version would still be better than nothing on
targets without highpart multiply.

So how about replacing the hook with one that simply asks whether
division through highpart multiplication is preferred over the
add/shift sequence?  (Unfortunately it's not going to be possible
to work that out from existing information.)

Thanks,
Richard

>
>> 
>> Some comments in addition to Richard's:
>> 
>> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> > Hi All,
>> >
>> > As discussed in the ticket, this replaces the approach for optimizing
>> > the div by bitmask operation from a hook into optabs implemented
>> > through add_highpart.
>> >
>> > In order to be able to use this we need to check whether the current
>> > precision has enough bits to do the operation without any of the additions
>> overflowing.
>> >
>> > We use range information to determine this and only do the operation
>> > if we're sure am overflow won't occur.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going>
>> issues.
>> >
>> > Ok for master?
>> >
>> > Thanks,
>> > Tamar
>> >
>> > gcc/ChangeLog:
>> >
>> > 	PR target/108583
>> > 	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST):
>> Remove.
>> > 	* doc/tm.texi.in: Likewise.
>> > 	* explow.cc (round_push, align_dynamic_address): Revert previous
>> patch.
>> > 	* expmed.cc (expand_divmod): Likewise.
>> > 	* expmed.h (expand_divmod): Likewise.
>> > 	* expr.cc (force_operand, expand_expr_divmod): Likewise.
>> > 	* optabs.cc (expand_doubleword_mod,
>> expand_doubleword_divmod): Likewise.
>> > 	* internal-fn.def (ADDH): New.
>> > 	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
>> > 	* doc/md.texi: Document them.
>> > 	* doc/rtl.texi: Likewise.
>> > 	* target.def (can_special_div_by_const): Remove.
>> > 	* target.h: Remove tree-core.h include
>> > 	* targhooks.cc (default_can_special_div_by_const): Remove.
>> > 	* targhooks.h (default_can_special_div_by_const): Remove.
>> > 	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
>> > 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook
>> and
>> > 	implement new obtab recognition based on range.
>> > 	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.
>> >
>> > gcc/testsuite/ChangeLog:
>> >
>> > 	PR target/108583
>> > 	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
>> > 	* gcc.dg/vect/vect-div-bitmask-5.c: New test.
>> >
>> > --- inline copy of patch --
>> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index
>> >
>> 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f74080
>> 3
>> > 8595e21af35d 100644
>> > --- a/gcc/doc/md.texi
>> > +++ b/gcc/doc/md.texi
>> > @@ -5668,6 +5668,18 @@ represented in RTL using a
>> @code{smul_highpart} RTX expression.
>> >  Similar, but the multiplication is unsigned.  This may be represented
>> > in RTL using an @code{umul_highpart} RTX expression.
>> >
>> > +@cindex @code{sadd@var{m}3_highpart} instruction pattern @item
>> > +@samp{smul@var{m}3_highpart}
>> 
>> sadd
>> 
>> > +Perform a signed addition of operands 1 and 2, which have mode
>> > +@var{m}, and store the most significant half of the product in operand 0.
>> > +The least significant half of the product is discarded.  This may be
>> > +represented in RTL using a @code{sadd_highpart} RTX expression.
>> > +
>> > +@cindex @code{uadd@var{m}3_highpart} instruction pattern @item
>> > +@samp{uadd@var{m}3_highpart} Similar, but the addition is unsigned.
>> > +This may be represented in RTL using an @code{uadd_highpart} RTX
>> > +expression.
>> > +
>> >  @cindex @code{madd@var{m}@var{n}4} instruction pattern  @item
>> > @samp{madd@var{m}@var{n}4}  Multiply operands 1 and 2, sign-extend
>> > them to mode @var{n}, add diff --git a/gcc/doc/rtl.texi
>> > b/gcc/doc/rtl.texi index
>> >
>> d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00343
>> d17
>> > 1940ec4222f3 100644
>> > --- a/gcc/doc/rtl.texi
>> > +++ b/gcc/doc/rtl.texi
>> > @@ -2535,6 +2535,17 @@ out in machine mode @var{m}.
>> > @code{smul_highpart} returns the high part  of a signed
>> > multiplication, @code{umul_highpart} returns the high part  of an unsigned
>> multiplication.
>> >
>> > +@findex sadd_highpart
>> > +@findex uadd_highpart
>> > +@cindex high-part addition
>> > +@cindex addition high part
>> > +@item (sadd_highpart:@var{m} @var{x} @var{y}) @itemx
>> > +(uadd_highpart:@var{m} @var{x} @var{y}) Represents the high-part
>> > +addition of @var{x} and @var{y} carried out in machine mode @var{m}.
>> > +@code{sadd_highpart} returns the high part of a signed addition,
>> > +@code{uadd_highpart} returns the high part of an unsigned addition.
>> 
>> The patch doesn't add these RTL codes though.
>> 
>> > +
>> >  @findex fma
>> >  @cindex fused multiply-add
>> >  @item (fma:@var{m} @var{x} @var{y} @var{z}) diff --git
>> > a/gcc/doc/tm.texi b/gcc/doc/tm.texi index
>> >
>> c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d57914840
>> 17e
>> > 6b0d62ab077e 100644
>> > --- a/gcc/doc/tm.texi
>> > +++ b/gcc/doc/tm.texi
>> > @@ -6137,22 +6137,6 @@ instruction pattern.  There is no need for the
>> > hook to handle these two  implementation approaches itself.
>> >  @end deftypefn
>> >
>> > -@deftypefn {Target Hook} bool
>> > TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum
>> @var{tree_code}, tree
>> > @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx
>> > @var{in0}, rtx @var{in1}) -This hook is used to test whether the
>> > target has a special method of -division of vectors of type @var{vectype}
>> using the value @var{constant}, -and producing a vector of type
>> @var{vectype}.  The division -will then not be decomposed by the vectorizer
>> and kept as a div.
>> > -
>> > -When the hook is being used to test whether the target supports a
>> > special -divide, @var{in0}, @var{in1}, and @var{output} are all null.
>> > When the hook -is being used to emit a division, @var{in0} and
>> > @var{in1} are the source -vectors of type @var{vecttype} and
>> > @var{output} is the destination vector of -type @var{vectype}.
>> > -
>> > -Return true if the operation is possible, emitting instructions for
>> > it -if rtxes are provided and updating @var{output}.
>> > -@end deftypefn
>> > -
>> >  @deftypefn {Target Hook} tree
>> > TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned
>> @var{code},
>> > tree @var{vec_type_out}, tree @var{vec_type_in})  This hook should
>> > return the decl of a function that implements the  vectorized variant
>> > of the function with the @code{combined_fn} code diff --git
>> > a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index
>> >
>> 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d1efec0
>> a3a
>> > bccd1c293c7b 100644
>> > --- a/gcc/doc/tm.texi.in
>> > +++ b/gcc/doc/tm.texi.in
>> > @@ -4173,8 +4173,6 @@ address;  but often a machine-dependent
>> strategy can generate better code.
>> >
>> >  @hook TARGET_VECTORIZE_VEC_PERM_CONST
>> >
>> > -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
>> > -
>> >  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
>> >
>> >  @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
>> > diff --git a/gcc/explow.cc b/gcc/explow.cc index
>> >
>> 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0
>> bef
>> > a016eea4573c 100644
>> > --- a/gcc/explow.cc
>> > +++ b/gcc/explow.cc
>> > @@ -1037,7 +1037,7 @@ round_push (rtx size)
>> >       TRUNC_DIV_EXPR.  */
>> >    size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
>> >  		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
>> > -  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, size,
>> > align_rtx,
>> > +  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
>> >  			NULL_RTX, 1);
>> >    size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
>> >
>> > @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned
>> required_align)
>> >  			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
>> >  				       Pmode),
>> >  			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
>> > -  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
>> > target,
>> > +  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
>> >  			  gen_int_mode (required_align / BITS_PER_UNIT,
>> >  					Pmode),
>> >  			  NULL_RTX, 1);
>> > diff --git a/gcc/expmed.h b/gcc/expmed.h index
>> >
>> 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c5364
>> 094
>> > 1628068f3901 100644
>> > --- a/gcc/expmed.h
>> > +++ b/gcc/expmed.h
>> > @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code,
>> > machine_mode, rtx, poly_int64, rtx,  extern rtx maybe_expand_shift
>> (enum tree_code, machine_mode, rtx, int, rtx,
>> >  			       int);
>> >  #ifdef GCC_OPTABS_H
>> > -extern rtx expand_divmod (int, enum tree_code, machine_mode, tree,
>> tree,
>> > -			  rtx, rtx, rtx, int,
>> > -			  enum optab_methods = OPTAB_LIB_WIDEN);
>> > +extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx,
>> rtx,
>> > +			  rtx, int, enum optab_methods =
>> OPTAB_LIB_WIDEN);
>> >  #endif
>> >  #endif
>> >
>> > diff --git a/gcc/expmed.cc b/gcc/expmed.cc index
>> >
>> 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3
>> a59
>> > c169d3b7692f 100644
>> > --- a/gcc/expmed.cc
>> > +++ b/gcc/expmed.cc
>> > @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode, rtx
>> op0,
>> > HOST_WIDE_INT d)
>> >
>> >  rtx
>> >  expand_divmod (int rem_flag, enum tree_code code, machine_mode
>> mode,
>> > -	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
>> > -	       int unsignedp, enum optab_methods methods)
>> > +	       rtx op0, rtx op1, rtx target, int unsignedp,
>> > +	       enum optab_methods methods)
>> >  {
>> >    machine_mode compute_mode;
>> >    rtx tquotient;
>> > @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum tree_code
>> > code, machine_mode mode,
>> >
>> >    last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) : 0;
>> >
>> > -  /* Check if the target has specific expansions for the division.
>> > */
>> > -  tree cst;
>> > -  if (treeop0
>> > -      && treeop1
>> > -      && (cst = uniform_integer_cst_p (treeop1))
>> > -      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE
>> (treeop0),
>> > -						     wi::to_wide (cst),
>> > -						     &target, op0, op1))
>> > -    return target;
>> > -
>> > -
>> >    /* Now convert to the best mode to use.  */
>> >    if (compute_mode != mode)
>> >      {
>> > @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum tree_code
>> code, machine_mode mode,
>> >  			    || (optab_handler (sdivmod_optab, int_mode)
>> >  				!= CODE_FOR_nothing)))
>> >  		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
>> > -						int_mode, treeop0, treeop1,
>> > -						op0, gen_int_mode (abs_d,
>> > +						int_mode, op0,
>> > +						gen_int_mode (abs_d,
>> >  							      int_mode),
>> >  						NULL_RTX, 0);
>> >  		    else
>> > @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum tree_code
>> code, machine_mode mode,
>> >  				      size - 1, NULL_RTX, 0);
>> >  		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
>> >  				    NULL_RTX);
>> > -		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode,
>> treeop0,
>> > -				    treeop1, t3, op1, NULL_RTX, 0);
>> > +		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3,
>> op1,
>> > +				    NULL_RTX, 0);
>> >  		if (t4)
>> >  		  {
>> >  		    rtx t5;
>> > diff --git a/gcc/expr.cc b/gcc/expr.cc index
>> >
>> 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b
>> 2280
>> > c6e277f26d72 100644
>> > --- a/gcc/expr.cc
>> > +++ b/gcc/expr.cc
>> > @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
>> >  	    return expand_divmod (0,
>> >  				  FLOAT_MODE_P (GET_MODE (value))
>> >  				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
>> > -				  GET_MODE (value), NULL, NULL, op1, op2,
>> > -				  target, 0);
>> > +				  GET_MODE (value), op1, op2, target, 0);
>> >  	case MOD:
>> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
>> NULL, NULL,
>> > -				op1, op2, target, 0);
>> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
>> op1, op2,
>> > +				target, 0);
>> >  	case UDIV:
>> > -	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
>> NULL, NULL,
>> > -				op1, op2, target, 1);
>> > +	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
>> op1, op2,
>> > +				target, 1);
>> >  	case UMOD:
>> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
>> NULL, NULL,
>> > -				op1, op2, target, 1);
>> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
>> op1, op2,
>> > +				target, 1);
>> >  	case ASHIFTRT:
>> >  	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
>> >  				      target, 0, OPTAB_LIB_WIDEN); @@ -
>> 9170,13 +9169,11 @@
>> > expand_expr_divmod (tree_code code, machine_mode mode, tree
>> treeop0,
>> >        bool speed_p = optimize_insn_for_speed_p ();
>> >        do_pending_stack_adjust ();
>> >        start_sequence ();
>> > -      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
>> > -				   op0, op1, target, 1);
>> > +      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1,
>> > + target, 1);
>> >        rtx_insn *uns_insns = get_insns ();
>> >        end_sequence ();
>> >        start_sequence ();
>> > -      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
>> > -				   op0, op1, target, 0);
>> > +      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1,
>> > + target, 0);
>> >        rtx_insn *sgn_insns = get_insns ();
>> >        end_sequence ();
>> >        unsigned uns_cost = seq_cost (uns_insns, speed_p); @@ -9198,8
>> > +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode
>> mode, tree treeop0,
>> >        emit_insn (sgn_insns);
>> >        return sgn_ret;
>> >      }
>> > -  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
>> > -			op0, op1, target, unsignedp);
>> > +  return expand_divmod (mod_p, code, mode, op0, op1, target,
>> > + unsignedp);
>> >  }
>> >
>> >  rtx
>> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
>> >
>> 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052584f5a
>> 3b
>> > 8a734baa800f 100644
>> > --- a/gcc/internal-fn.def
>> > +++ b/gcc/internal-fn.def
>> > @@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL,
>> ECF_CONST
>> > | ECF_NOTHROW, first,
>> >
>> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST |
>> ECF_NOTHROW, first,
>> >  			      smul_highpart, umul_highpart, binary)
>> > +DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST |
>> ECF_NOTHROW, first,
>> > +			      sadd_highpart, uadd_highpart, binary)
>> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST |
>> ECF_NOTHROW, first,
>> >  			      smulhs, umulhs, binary)
>> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST |
>> ECF_NOTHROW, first,
>> > diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
>> >
>> cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369a6
>> e
>> > 77082c1e617b 100644
>> > --- a/gcc/optabs.cc
>> > +++ b/gcc/optabs.cc
>> > @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode
>> mode, rtx op0, rtx op1, bool unsignedp)
>> >  		return NULL_RTX;
>> >  	    }
>> >  	}
>> > -      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode,
>> NULL, NULL,
>> > -				     sum, gen_int_mode (INTVAL (op1),
>> > -							word_mode),
>> > +      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode,
>> sum,
>> > +				     gen_int_mode (INTVAL (op1),
>> word_mode),
>> >  				     NULL_RTX, 1, OPTAB_DIRECT);
>> >        if (remainder == NULL_RTX)
>> >  	return NULL_RTX;
>> > @@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode
>> mode, rtx
>> > op0, rtx op1, rtx *rem,
>> >
>> >    if (op11 != const1_rtx)
>> >      {
>> > -      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL, NULL,
>> quot1,
>> > -				op11, NULL_RTX, unsignedp,
>> OPTAB_DIRECT);
>> > +      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1,
>> op11,
>> > +				NULL_RTX, unsignedp, OPTAB_DIRECT);
>> >        if (rem2 == NULL_RTX)
>> >  	return NULL_RTX;
>> >
>> > @@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode
>> mode, rtx op0, rtx op1, rtx *rem,
>> >        if (rem2 == NULL_RTX)
>> >  	return NULL_RTX;
>> >
>> > -      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL, NULL,
>> quot1,
>> > -				 op11, NULL_RTX, unsignedp,
>> OPTAB_DIRECT);
>> > +      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1, op11,
>> > +				 NULL_RTX, unsignedp, OPTAB_DIRECT);
>> >        if (quot2 == NULL_RTX)
>> >  	return NULL_RTX;
>> >
>> > diff --git a/gcc/optabs.def b/gcc/optabs.def index
>> >
>> 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2d7a5
>> ccb
>> > f6147947351a 100644
>> > --- a/gcc/optabs.def
>> > +++ b/gcc/optabs.def
>> > @@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
>> >
>> >  OPTAB_D (smul_highpart_optab, "smul$a3_highpart")  OPTAB_D
>> > (umul_highpart_optab, "umul$a3_highpart")
>> > +OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart") OPTAB_D
>> > +(uadd_highpart_optab, "uadd$a3_highpart")
>> >
>> >  OPTAB_D (cmpmem_optab, "cmpmem$a")
>> >  OPTAB_D (cmpstr_optab, "cmpstr$a")
>> > diff --git a/gcc/target.def b/gcc/target.def index
>> >
>> db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d
>> 81a
>> > fa2c2baa64a5 100644
>> > --- a/gcc/target.def
>> > +++ b/gcc/target.def
>> > @@ -1905,25 +1905,6 @@ implementation approaches itself.",
>> >  	const vec_perm_indices &sel),
>> >   NULL)
>> >
>> > -DEFHOOK
>> > -(can_special_div_by_const,
>> > - "This hook is used to test whether the target has a special method
>> > of\n\ -division of vectors of type @var{vectype} using the value
>> > @var{constant},\n\ -and producing a vector of type @var{vectype}.  The
>> > division\n\ -will then not be decomposed by the vectorizer and kept as
>> > a div.\n\ -\n\ -When the hook is being used to test whether the target
>> > supports a special\n\ -divide, @var{in0}, @var{in1}, and @var{output}
>> > are all null.  When the hook\n\ -is being used to emit a division,
>> > @var{in0} and @var{in1} are the source\n\ -vectors of type
>> > @var{vecttype} and @var{output} is the destination vector of\n\ -type
>> > @var{vectype}.\n\ -\n\ -Return true if the operation is possible,
>> > emitting instructions for it\n\ -if rtxes are provided and updating
>> > @var{output}.",
>> > - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
>> > -	rtx in0, rtx in1),
>> > - default_can_special_div_by_const)
>> > -
>> >  /* Return true if the target supports misaligned store/load of a
>> >     specific factor denoted in the third parameter.  The last parameter
>> >     is true if the access is defined in a packed struct.  */ diff
>> > --git a/gcc/target.h b/gcc/target.h index
>> >
>> 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b9
>> 9f9
>> > 13158c2d47b1 100644
>> > --- a/gcc/target.h
>> > +++ b/gcc/target.h
>> > @@ -51,7 +51,6 @@
>> >  #include "insn-codes.h"
>> >  #include "tm.h"
>> >  #include "hard-reg-set.h"
>> > -#include "tree-core.h"
>> >
>> >  #if CHECKING_P
>> >
>> > diff --git a/gcc/targhooks.h b/gcc/targhooks.h index
>> >
>> a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad224454
>> 93
>> > 17a31390f0c2 100644
>> > --- a/gcc/targhooks.h
>> > +++ b/gcc/targhooks.h
>> > @@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage
>> > (addr_space_t, location_t);  extern rtx default_addr_space_convert
>> > (rtx, tree, tree);  extern unsigned int default_case_values_threshold
>> > (void);  extern bool default_have_conditional_execution (void);
>> > -extern bool default_can_special_div_by_const (enum tree_code, tree,
>> wide_int,
>> > -					      rtx *, rtx, rtx);
>> >
>> >  extern bool default_libc_has_function (enum function_class, tree);
>> > extern bool default_libc_has_fast_function (int fcode); diff --git
>> > a/gcc/targhooks.cc b/gcc/targhooks.cc index
>> >
>> fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91e
>> 03
>> > 877337a931e7 100644
>> > --- a/gcc/targhooks.cc
>> > +++ b/gcc/targhooks.cc
>> > @@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
>> >    return HAVE_conditional_execution;
>> >  }
>> >
>> > -/* Default that no division by constant operations are special.  */
>> > -bool -default_can_special_div_by_const (enum tree_code, tree,
>> > wide_int, rtx *, rtx,
>> > -				  rtx)
>> > -{
>> > -  return false;
>> > -}
>> > -
>> >  /* By default we assume that c99 functions are present at the runtime,
>> >     but sincos is not.  */
>> >  bool
>> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
>> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
>> > new file mode 100644
>> > index
>> >
>> 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0
>> a0
>> > 4ea8c1f73e3c
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
>> > @@ -0,0 +1,25 @@
>> > +/* { dg-require-effective-target vect_int } */
>> > +
>> > +#include <stdint.h>
>> > +#include "tree-vect.h"
>> > +
>> > +typedef unsigned __attribute__((__vector_size__ (16))) V;
>> > +
>> > +static __attribute__((__noinline__)) __attribute__((__noclone__)) V
>> > +foo (V v, unsigned short i) {
>> > +  v /= i;
>> > +  return v;
>> > +}
>> > +
>> > +int
>> > +main (void)
>> > +{
>> > +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff },
>> > +0xffff);
>> > +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
>> > +    if (v[i] != 0x00010001)
>> > +      __builtin_abort ();
>> > +  return 0;
>> > +}
>> > +
>> > +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern:
>> > +detected" "vect" { target aarch64*-*-* } } } */
>> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
>> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
>> > new file mode 100644
>> > index
>> >
>> 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b4991
>> 4d2
>> > a29b933de625
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
>> > @@ -0,0 +1,58 @@
>> > +/* { dg-require-effective-target vect_int } */
>> > +
>> > +#include <stdint.h>
>> > +#include <stdio.h>
>> > +#include "tree-vect.h"
>> > +
>> > +#define N 50
>> > +#define TYPE uint8_t
>> > +
>> > +#ifndef DEBUG
>> > +#define DEBUG 0
>> > +#endif
>> > +
>> > +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
>> > +
>> > +
>> > +__attribute__((noipa, noinline, optimize("O1"))) void fun1(TYPE*
>> > +restrict pixel, TYPE level, int n) {
>> > +  for (int i = 0; i < n; i+=1)
>> > +    pixel[i] = (pixel[i] + level) / 0xff; }
>> > +
>> > +__attribute__((noipa, noinline, optimize("O3"))) void fun2(TYPE*
>> > +restrict pixel, TYPE level, int n) {
>> > +  for (int i = 0; i < n; i+=1)
>> > +    pixel[i] = (pixel[i] + level) / 0xff; }
>> > +
>> > +int main ()
>> > +{
>> > +  TYPE a[N];
>> > +  TYPE b[N];
>> > +
>> > +  for (int i = 0; i < N; ++i)
>> > +    {
>> > +      a[i] = BASE + i * 13;
>> > +      b[i] = BASE + i * 13;
>> > +      if (DEBUG)
>> > +        printf ("%d: 0x%x\n", i, a[i]);
>> > +    }
>> > +
>> > +  fun1 (a, N / 2, N);
>> > +  fun2 (b, N / 2, N);
>> > +
>> > +  for (int i = 0; i < N; ++i)
>> > +    {
>> > +      if (DEBUG)
>> > +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
>> > +
>> > +      if (a[i] != b[i])
>> > +        __builtin_abort ();
>> > +    }
>> > +  return 0;
>> > +}
>> > +
>> > +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" {
>> > +target aarch64*-*-* } } } */
>> > diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc index
>> >
>> 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077d
>> c3
>> > e970bed75ef6 100644
>> > --- a/gcc/tree-vect-generic.cc
>> > +++ b/gcc/tree-vect-generic.cc
>> > @@ -1237,17 +1237,6 @@ expand_vector_operation
>> (gimple_stmt_iterator *gsi, tree type, tree compute_type
>> >  	  tree rhs2 = gimple_assign_rhs2 (assign);
>> >  	  tree ret;
>> >
>> > -	  /* Check if the target was going to handle it through the special
>> > -	     division callback hook.  */
>> > -	  tree cst = uniform_integer_cst_p (rhs2);
>> > -	  if (cst &&
>> > -	      targetm.vectorize.can_special_div_by_const (code, type,
>> > -							  wi::to_wide (cst),
>> > -							  NULL,
>> > -							  NULL_RTX,
>> NULL_RTX))
>> > -	    return NULL_TREE;
>> > -
>> > -
>> >  	  if (!optimize
>> >  	      || !VECTOR_INTEGER_TYPE_P (type)
>> >  	      || TREE_CODE (rhs2) != VECTOR_CST diff --git
>> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
>> >
>> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
>> 69
>> > de2afea139d6 100644
>> > --- a/gcc/tree-vect-patterns.cc
>> > +++ b/gcc/tree-vect-patterns.cc
>> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info *vinfo,
>> >        return pattern_stmt;
>> >      }
>> >    else if ((cst = uniform_integer_cst_p (oprnd1))
>> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
>> vectype,
>> > -							  wi::to_wide (cst),
>> > -							  NULL, NULL_RTX,
>> > -							  NULL_RTX))
>> > +	   && TYPE_UNSIGNED (itype)
>> > +	   && rhs_code == TRUNC_DIV_EXPR
>> > +	   && vectype
>> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
>> > +					      OPTIMIZE_FOR_SPEED))
>> >      {
>> > -      return NULL;
>> > +      /* div optimizations using narrowings
>> > +       we can do the division e.g. shorts by 255 faster by calculating it as
>> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
>> > +       double the precision of x.
>> > +
>> > +       If we imagine a short as being composed of two blocks of bytes then
>> > +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
>> > +       adding 1 to each sub component:
>> > +
>> > +	    short value of 16-bits
>> > +       ┌──────────────┬────────────────┐
>> > +       │              │                │
>> > +       └──────────────┴────────────────┘
>> > +	 8-bit part1 ▲  8-bit part2   ▲
>> > +		     │                │
>> > +		     │                │
>> > +		    +1               +1
>> > +
>> > +       after the first addition, we have to shift right by 8, and narrow the
>> > +       results back to a byte.  Remember that the addition must be done in
>> > +       double the precision of the input.  However if we know that the
>> addition
>> > +       `x + 257` does not overflow then we can do the operation in the
>> current
>> > +       precision.  In which case we don't need the pack and unpacks.  */
>> > +      auto wcst = wi::to_wide (cst);
>> > +      int pow = wi::exact_log2 (wcst + 1);
>> > +      if (pow == (int) (element_precision (vectype) / 2))
>> > +	{
>> > +	  wide_int min,max;
>> > +	  /* If we're in a pattern we need to find the orginal definition.  */
>> > +	  tree op0 = oprnd0;
>> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
>> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
>> > +	  if (is_pattern_stmt_p (stmt_info))
>> > +	    {
>> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
>> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
>> > +	    }
>> 
>> If this is generally safe (I'm skipping thinking about it in the interests of a
>> quick review :-)), then I think it should be done in vect_get_range_info
>> instead.  Using gimple_get_lhs would be more general than handling just
>> assignments.
>> 
>> > +
>> > +	  /* Check that no overflow will occur.  If we don't have range
>> > +	     information we can't perform the optimization.  */
>> > +	  if (vect_get_range_info (op0, &min, &max))
>> > +	    {
>> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
>> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
>> > +	      wi::overflow_type ovf;
>> > +	      /* We need adder and max in the same precision.  */
>> > +	      wide_int zadder
>> > +		= wide_int_storage::from (adder, wi::get_precision (max),
>> > +					  UNSIGNED);
>> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
>> 
>> Could you explain this a bit more?  When do we have mismatched
>> precisions?
>
> C promotion rules will promote e.g.
>
> void fun2(uint8_t* restrict pixel, uint8_t level, int n)
> {
>   for (int i = 0; i < n; i+=1)
>     pixel[i] = (pixel[i] + level) / 0xff;
> }
>
> And have the addition be done as a 32 bit integer.  The vectorizer will demote this down
> to a short, but range information is not stored for patterns.  So In the above the range will
> correctly be 0x1fe but the precision will be that of the original expression, so 32.  This will
> be a mismatch with itype which is derived from the size the vectorizer will perform the
> operation in.
>
> Thanks,
> Tamar
>
>> 
>> Thanks,
>> Richard
>> 
>> > +	      if (ovf == wi::OVF_NONE)
>> > +		{
>> > +		  *type_out = vectype;
>> > +		  tree tadder = wide_int_to_tree (itype, adder);
>> > +		  gcall *patt1
>> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0,
>> tadder);
>> > +		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
>> > +		  gimple_call_set_lhs (patt1, lhs);
>> > +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1,
>> vectype);
>> > +
>> > +		  pattern_stmt
>> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
>> > +		  lhs = vect_recog_temp_ssa_var (itype, NULL);
>> > +		  gimple_call_set_lhs (pattern_stmt, lhs);
>> > +
>> > +		  return pattern_stmt;
>> > +		}
>> > +	    }
>> > +	}
>> >      }
>> >
>> >    if (prec > HOST_BITS_PER_WIDE_INT
>> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index
>> >
>> eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b95
>> 64f
>> > c4e066e50081 100644
>> > --- a/gcc/tree-vect-stmts.cc
>> > +++ b/gcc/tree-vect-stmts.cc
>> > @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
>> >  	}
>> >        target_support_p = (optab_handler (optab, vec_mode)
>> >  			  != CODE_FOR_nothing);
>> > -      tree cst;
>> > -      if (!target_support_p
>> > -	  && op1
>> > -	  && (cst = uniform_integer_cst_p (op1)))
>> > -	target_support_p
>> > -	  = targetm.vectorize.can_special_div_by_const (code, vectype,
>> > -							wi::to_wide (cst),
>> > -							NULL, NULL_RTX,
>> > -							NULL_RTX);
>> >      }
>> >
>> >    bool using_emulated_vectors_p = vect_emulated_vector_p (vectype);
Tamar Christina Feb. 10, 2023, 2:54 p.m. UTC | #7
> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Friday, February 10, 2023 2:31 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> Sent: Friday, February 10, 2023 1:36 PM
> >> To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>
> >> Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>;
> >> rguenther@suse.de; jlaw@ventanamicro.com
> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
> >> div-bitmask by using new optabs [PR108583]
> >>
> >> I think I'm misunderstanding, but: it seems like we're treating the
> >> add highpart optabs as companions to the mul highpart optabs.  But
> >> AIUI, the add highpart optab is used such that, for an N-bit mode, we
> >> do an N-bit addition followed by a shift by N/2.  Is that right?
> >> The mul highpart optabs instead do an 2N-bit multiplication followed
> >> by a shift by N.
> >
> > Correct.
> >
> >>
> >> Apart from consistency, the reason this matters is: I'm not sure what
> >> we gain by adding the optab rather than simply open-coding the
> >> addition and the shift directly into the vector pattern.  It seems
> >> like the AArch64 expander in
> >> 2/2 does just do an ordinary N-bit addition followed by an ordinary
> >> shift by N/2.
> >
> > I mentioned in the implementation, but I did so because AArch64 has
> > various optimization on shifts when it comes to truncating results.  I
> > didn't need to represent it with shifts, in fact the original pattern
> > did not. But representing it directly in the final instructions are
> > problematic because these instructions are unspecs and I would have
> needed to provide additional optabs to optimize them in.
> >
> > So the shift representation was more natural for AArch64. It would not
> > be say for
> > AArch32 which does not have these optimizations already. SVE has
> > similar optimizations and at the very worse you get an usra.
> >
> > I avoided open coding it with add and shift because it creates a 4
> > instructions (and shifts which are typically slow) dependency chain
> > instead of a load and multiply.  This change, unless the target is
> > known to optimize it further is unlikely to be beneficial.  And by the
> > time we get to costing the only alternative is to undo the existing pattern
> and so you lose the general shift optimization.
> >
> > So it seemed unwise to open code as shifts, given the codegen out of
> > the vectorizer would be degenerate for most targets or one needs the
> > more complicated route of costing during pattern matching already.
> 
> Hmm, OK.  That seems like a cost-model thing though, rather than something
> that should be exposed through optabs.  And I imagine the open-coded
> version would still be better than nothing on targets without highpart
> multiply.

Yeah but I don't think we've ever done costing on patterns during matching.
It's always been commit and go under the assumption that the replacement
Is always going to be cheaper.

> 
> So how about replacing the hook with one that simply asks whether division
> through highpart multiplication is preferred over the add/shift sequence?
> (Unfortunately it's not going to be possible to work that out from existing
> information.)

If Richi has no objections to it I can do that instead then..

Just to clarify, are you satisfied with the answer on the mixed precisions?

Thanks,
Tamar

> 
> Thanks,
> Richard
> 
> >
> >>
> >> Some comments in addition to Richard's:
> >>
> >> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> > Hi All,
> >> >
> >> > As discussed in the ticket, this replaces the approach for
> >> > optimizing the div by bitmask operation from a hook into optabs
> >> > implemented through add_highpart.
> >> >
> >> > In order to be able to use this we need to check whether the
> >> > current precision has enough bits to do the operation without any
> >> > of the additions
> >> overflowing.
> >> >
> >> > We use range information to determine this and only do the
> >> > operation if we're sure am overflow won't occur.
> >> >
> >> > Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going>
> >> issues.
> >> >
> >> > Ok for master?
> >> >
> >> > Thanks,
> >> > Tamar
> >> >
> >> > gcc/ChangeLog:
> >> >
> >> > 	PR target/108583
> >> > 	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST):
> >> Remove.
> >> > 	* doc/tm.texi.in: Likewise.
> >> > 	* explow.cc (round_push, align_dynamic_address): Revert previous
> >> patch.
> >> > 	* expmed.cc (expand_divmod): Likewise.
> >> > 	* expmed.h (expand_divmod): Likewise.
> >> > 	* expr.cc (force_operand, expand_expr_divmod): Likewise.
> >> > 	* optabs.cc (expand_doubleword_mod,
> >> expand_doubleword_divmod): Likewise.
> >> > 	* internal-fn.def (ADDH): New.
> >> > 	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
> >> > 	* doc/md.texi: Document them.
> >> > 	* doc/rtl.texi: Likewise.
> >> > 	* target.def (can_special_div_by_const): Remove.
> >> > 	* target.h: Remove tree-core.h include
> >> > 	* targhooks.cc (default_can_special_div_by_const): Remove.
> >> > 	* targhooks.h (default_can_special_div_by_const): Remove.
> >> > 	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
> >> > 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook
> >> and
> >> > 	implement new obtab recognition based on range.
> >> > 	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.
> >> >
> >> > gcc/testsuite/ChangeLog:
> >> >
> >> > 	PR target/108583
> >> > 	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
> >> > 	* gcc.dg/vect/vect-div-bitmask-5.c: New test.
> >> >
> >> > --- inline copy of patch --
> >> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index
> >> >
> >>
> 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f74080
> >> 3
> >> > 8595e21af35d 100644
> >> > --- a/gcc/doc/md.texi
> >> > +++ b/gcc/doc/md.texi
> >> > @@ -5668,6 +5668,18 @@ represented in RTL using a
> >> @code{smul_highpart} RTX expression.
> >> >  Similar, but the multiplication is unsigned.  This may be
> >> > represented in RTL using an @code{umul_highpart} RTX expression.
> >> >
> >> > +@cindex @code{sadd@var{m}3_highpart} instruction pattern @item
> >> > +@samp{smul@var{m}3_highpart}
> >>
> >> sadd
> >>
> >> > +Perform a signed addition of operands 1 and 2, which have mode
> >> > +@var{m}, and store the most significant half of the product in operand
> 0.
> >> > +The least significant half of the product is discarded.  This may
> >> > +be represented in RTL using a @code{sadd_highpart} RTX expression.
> >> > +
> >> > +@cindex @code{uadd@var{m}3_highpart} instruction pattern @item
> >> > +@samp{uadd@var{m}3_highpart} Similar, but the addition is unsigned.
> >> > +This may be represented in RTL using an @code{uadd_highpart} RTX
> >> > +expression.
> >> > +
> >> >  @cindex @code{madd@var{m}@var{n}4} instruction pattern  @item
> >> > @samp{madd@var{m}@var{n}4}  Multiply operands 1 and 2, sign-
> extend
> >> > them to mode @var{n}, add diff --git a/gcc/doc/rtl.texi
> >> > b/gcc/doc/rtl.texi index
> >> >
> >>
> d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00343
> >> d17
> >> > 1940ec4222f3 100644
> >> > --- a/gcc/doc/rtl.texi
> >> > +++ b/gcc/doc/rtl.texi
> >> > @@ -2535,6 +2535,17 @@ out in machine mode @var{m}.
> >> > @code{smul_highpart} returns the high part  of a signed
> >> > multiplication, @code{umul_highpart} returns the high part  of an
> >> > unsigned
> >> multiplication.
> >> >
> >> > +@findex sadd_highpart
> >> > +@findex uadd_highpart
> >> > +@cindex high-part addition
> >> > +@cindex addition high part
> >> > +@item (sadd_highpart:@var{m} @var{x} @var{y}) @itemx
> >> > +(uadd_highpart:@var{m} @var{x} @var{y}) Represents the high-part
> >> > +addition of @var{x} and @var{y} carried out in machine mode
> @var{m}.
> >> > +@code{sadd_highpart} returns the high part of a signed addition,
> >> > +@code{uadd_highpart} returns the high part of an unsigned addition.
> >>
> >> The patch doesn't add these RTL codes though.
> >>
> >> > +
> >> >  @findex fma
> >> >  @cindex fused multiply-add
> >> >  @item (fma:@var{m} @var{x} @var{y} @var{z}) diff --git
> >> > a/gcc/doc/tm.texi b/gcc/doc/tm.texi index
> >> >
> >>
> c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d57914840
> >> 17e
> >> > 6b0d62ab077e 100644
> >> > --- a/gcc/doc/tm.texi
> >> > +++ b/gcc/doc/tm.texi
> >> > @@ -6137,22 +6137,6 @@ instruction pattern.  There is no need for
> >> > the hook to handle these two  implementation approaches itself.
> >> >  @end deftypefn
> >> >
> >> > -@deftypefn {Target Hook} bool
> >> > TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum
> >> @var{tree_code}, tree
> >> > @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx
> >> > @var{in0}, rtx @var{in1}) -This hook is used to test whether the
> >> > target has a special method of -division of vectors of type
> >> > @var{vectype}
> >> using the value @var{constant}, -and producing a vector of type
> >> @var{vectype}.  The division -will then not be decomposed by the
> >> vectorizer and kept as a div.
> >> > -
> >> > -When the hook is being used to test whether the target supports a
> >> > special -divide, @var{in0}, @var{in1}, and @var{output} are all null.
> >> > When the hook -is being used to emit a division, @var{in0} and
> >> > @var{in1} are the source -vectors of type @var{vecttype} and
> >> > @var{output} is the destination vector of -type @var{vectype}.
> >> > -
> >> > -Return true if the operation is possible, emitting instructions
> >> > for it -if rtxes are provided and updating @var{output}.
> >> > -@end deftypefn
> >> > -
> >> >  @deftypefn {Target Hook} tree
> >> > TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned
> >> @var{code},
> >> > tree @var{vec_type_out}, tree @var{vec_type_in})  This hook should
> >> > return the decl of a function that implements the  vectorized
> >> > variant of the function with the @code{combined_fn} code diff --git
> >> > a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index
> >> >
> >>
> 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d1efec0
> >> a3a
> >> > bccd1c293c7b 100644
> >> > --- a/gcc/doc/tm.texi.in
> >> > +++ b/gcc/doc/tm.texi.in
> >> > @@ -4173,8 +4173,6 @@ address;  but often a machine-dependent
> >> strategy can generate better code.
> >> >
> >> >  @hook TARGET_VECTORIZE_VEC_PERM_CONST
> >> >
> >> > -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
> >> > -
> >> >  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
> >> >
> >> >  @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
> >> > diff --git a/gcc/explow.cc b/gcc/explow.cc index
> >> >
> >>
> 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0
> >> bef
> >> > a016eea4573c 100644
> >> > --- a/gcc/explow.cc
> >> > +++ b/gcc/explow.cc
> >> > @@ -1037,7 +1037,7 @@ round_push (rtx size)
> >> >       TRUNC_DIV_EXPR.  */
> >> >    size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
> >> >  		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
> >> > -  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
> >> > size, align_rtx,
> >> > +  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
> >> >  			NULL_RTX, 1);
> >> >    size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
> >> >
> >> > @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned
> >> required_align)
> >> >  			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
> >> >  				       Pmode),
> >> >  			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
> >> > -  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
> >> > target,
> >> > +  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
> >> >  			  gen_int_mode (required_align / BITS_PER_UNIT,
> >> >  					Pmode),
> >> >  			  NULL_RTX, 1);
> >> > diff --git a/gcc/expmed.h b/gcc/expmed.h index
> >> >
> >>
> 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c5364
> >> 094
> >> > 1628068f3901 100644
> >> > --- a/gcc/expmed.h
> >> > +++ b/gcc/expmed.h
> >> > @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code,
> >> > machine_mode, rtx, poly_int64, rtx,  extern rtx maybe_expand_shift
> >> (enum tree_code, machine_mode, rtx, int, rtx,
> >> >  			       int);
> >> >  #ifdef GCC_OPTABS_H
> >> > -extern rtx expand_divmod (int, enum tree_code, machine_mode,
> tree,
> >> tree,
> >> > -			  rtx, rtx, rtx, int,
> >> > -			  enum optab_methods = OPTAB_LIB_WIDEN);
> >> > +extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx,
> >> rtx,
> >> > +			  rtx, int, enum optab_methods =
> >> OPTAB_LIB_WIDEN);
> >> >  #endif
> >> >  #endif
> >> >
> >> > diff --git a/gcc/expmed.cc b/gcc/expmed.cc index
> >> >
> >>
> 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3
> >> a59
> >> > c169d3b7692f 100644
> >> > --- a/gcc/expmed.cc
> >> > +++ b/gcc/expmed.cc
> >> > @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode,
> rtx
> >> op0,
> >> > HOST_WIDE_INT d)
> >> >
> >> >  rtx
> >> >  expand_divmod (int rem_flag, enum tree_code code, machine_mode
> >> mode,
> >> > -	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
> >> > -	       int unsignedp, enum optab_methods methods)
> >> > +	       rtx op0, rtx op1, rtx target, int unsignedp,
> >> > +	       enum optab_methods methods)
> >> >  {
> >> >    machine_mode compute_mode;
> >> >    rtx tquotient;
> >> > @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum
> tree_code
> >> > code, machine_mode mode,
> >> >
> >> >    last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) :
> >> > 0;
> >> >
> >> > -  /* Check if the target has specific expansions for the division.
> >> > */
> >> > -  tree cst;
> >> > -  if (treeop0
> >> > -      && treeop1
> >> > -      && (cst = uniform_integer_cst_p (treeop1))
> >> > -      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE
> >> (treeop0),
> >> > -						     wi::to_wide (cst),
> >> > -						     &target, op0, op1))
> >> > -    return target;
> >> > -
> >> > -
> >> >    /* Now convert to the best mode to use.  */
> >> >    if (compute_mode != mode)
> >> >      {
> >> > @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum
> tree_code
> >> code, machine_mode mode,
> >> >  			    || (optab_handler (sdivmod_optab, int_mode)
> >> >  				!= CODE_FOR_nothing)))
> >> >  		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
> >> > -						int_mode, treeop0, treeop1,
> >> > -						op0, gen_int_mode (abs_d,
> >> > +						int_mode, op0,
> >> > +						gen_int_mode (abs_d,
> >> >  							      int_mode),
> >> >  						NULL_RTX, 0);
> >> >  		    else
> >> > @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum
> tree_code
> >> code, machine_mode mode,
> >> >  				      size - 1, NULL_RTX, 0);
> >> >  		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
> >> >  				    NULL_RTX);
> >> > -		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode,
> >> treeop0,
> >> > -				    treeop1, t3, op1, NULL_RTX, 0);
> >> > +		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3,
> >> op1,
> >> > +				    NULL_RTX, 0);
> >> >  		if (t4)
> >> >  		  {
> >> >  		    rtx t5;
> >> > diff --git a/gcc/expr.cc b/gcc/expr.cc index
> >> >
> >>
> 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b
> >> 2280
> >> > c6e277f26d72 100644
> >> > --- a/gcc/expr.cc
> >> > +++ b/gcc/expr.cc
> >> > @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
> >> >  	    return expand_divmod (0,
> >> >  				  FLOAT_MODE_P (GET_MODE (value))
> >> >  				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
> >> > -				  GET_MODE (value), NULL, NULL, op1, op2,
> >> > -				  target, 0);
> >> > +				  GET_MODE (value), op1, op2, target, 0);
> >> >  	case MOD:
> >> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> >> NULL, NULL,
> >> > -				op1, op2, target, 0);
> >> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> >> op1, op2,
> >> > +				target, 0);
> >> >  	case UDIV:
> >> > -	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
> >> NULL, NULL,
> >> > -				op1, op2, target, 1);
> >> > +	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
> >> op1, op2,
> >> > +				target, 1);
> >> >  	case UMOD:
> >> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> >> NULL, NULL,
> >> > -				op1, op2, target, 1);
> >> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> >> op1, op2,
> >> > +				target, 1);
> >> >  	case ASHIFTRT:
> >> >  	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
> >> >  				      target, 0, OPTAB_LIB_WIDEN); @@ -
> >> 9170,13 +9169,11 @@
> >> > expand_expr_divmod (tree_code code, machine_mode mode, tree
> >> treeop0,
> >> >        bool speed_p = optimize_insn_for_speed_p ();
> >> >        do_pending_stack_adjust ();
> >> >        start_sequence ();
> >> > -      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0,
> treeop1,
> >> > -				   op0, op1, target, 1);
> >> > +      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1,
> >> > + target, 1);
> >> >        rtx_insn *uns_insns = get_insns ();
> >> >        end_sequence ();
> >> >        start_sequence ();
> >> > -      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0,
> treeop1,
> >> > -				   op0, op1, target, 0);
> >> > +      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1,
> >> > + target, 0);
> >> >        rtx_insn *sgn_insns = get_insns ();
> >> >        end_sequence ();
> >> >        unsigned uns_cost = seq_cost (uns_insns, speed_p); @@
> >> > -9198,8
> >> > +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode
> >> mode, tree treeop0,
> >> >        emit_insn (sgn_insns);
> >> >        return sgn_ret;
> >> >      }
> >> > -  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
> >> > -			op0, op1, target, unsignedp);
> >> > +  return expand_divmod (mod_p, code, mode, op0, op1, target,
> >> > + unsignedp);
> >> >  }
> >> >
> >> >  rtx
> >> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
> >> >
> >>
> 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052584f5a
> >> 3b
> >> > 8a734baa800f 100644
> >> > --- a/gcc/internal-fn.def
> >> > +++ b/gcc/internal-fn.def
> >> > @@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL,
> >> ECF_CONST
> >> > | ECF_NOTHROW, first,
> >> >
> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST |
> >> ECF_NOTHROW, first,
> >> >  			      smul_highpart, umul_highpart, binary)
> >> > +DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST |
> >> ECF_NOTHROW, first,
> >> > +			      sadd_highpart, uadd_highpart, binary)
> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST |
> >> ECF_NOTHROW, first,
> >> >  			      smulhs, umulhs, binary)
> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST |
> >> ECF_NOTHROW, first,
> >> > diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
> >> >
> >>
> cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369a6
> >> e
> >> > 77082c1e617b 100644
> >> > --- a/gcc/optabs.cc
> >> > +++ b/gcc/optabs.cc
> >> > @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode
> >> mode, rtx op0, rtx op1, bool unsignedp)
> >> >  		return NULL_RTX;
> >> >  	    }
> >> >  	}
> >> > -      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR,
> word_mode,
> >> NULL, NULL,
> >> > -				     sum, gen_int_mode (INTVAL (op1),
> >> > -							word_mode),
> >> > +      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR,
> word_mode,
> >> sum,
> >> > +				     gen_int_mode (INTVAL (op1),
> >> word_mode),
> >> >  				     NULL_RTX, 1, OPTAB_DIRECT);
> >> >        if (remainder == NULL_RTX)
> >> >  	return NULL_RTX;
> >> > @@ -1211,8 +1210,8 @@ expand_doubleword_divmod
> (machine_mode
> >> mode, rtx
> >> > op0, rtx op1, rtx *rem,
> >> >
> >> >    if (op11 != const1_rtx)
> >> >      {
> >> > -      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL,
> NULL,
> >> quot1,
> >> > -				op11, NULL_RTX, unsignedp,
> >> OPTAB_DIRECT);
> >> > +      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1,
> >> op11,
> >> > +				NULL_RTX, unsignedp, OPTAB_DIRECT);
> >> >        if (rem2 == NULL_RTX)
> >> >  	return NULL_RTX;
> >> >
> >> > @@ -1226,8 +1225,8 @@ expand_doubleword_divmod
> (machine_mode
> >> mode, rtx op0, rtx op1, rtx *rem,
> >> >        if (rem2 == NULL_RTX)
> >> >  	return NULL_RTX;
> >> >
> >> > -      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL,
> NULL,
> >> quot1,
> >> > -				 op11, NULL_RTX, unsignedp,
> >> OPTAB_DIRECT);
> >> > +      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1,
> op11,
> >> > +				 NULL_RTX, unsignedp, OPTAB_DIRECT);
> >> >        if (quot2 == NULL_RTX)
> >> >  	return NULL_RTX;
> >> >
> >> > diff --git a/gcc/optabs.def b/gcc/optabs.def index
> >> >
> >>
> 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2d7a5
> >> ccb
> >> > f6147947351a 100644
> >> > --- a/gcc/optabs.def
> >> > +++ b/gcc/optabs.def
> >> > @@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
> >> >
> >> >  OPTAB_D (smul_highpart_optab, "smul$a3_highpart")  OPTAB_D
> >> > (umul_highpart_optab, "umul$a3_highpart")
> >> > +OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart") OPTAB_D
> >> > +(uadd_highpart_optab, "uadd$a3_highpart")
> >> >
> >> >  OPTAB_D (cmpmem_optab, "cmpmem$a")  OPTAB_D (cmpstr_optab,
> >> > "cmpstr$a") diff --git a/gcc/target.def b/gcc/target.def index
> >> >
> >>
> db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d
> >> 81a
> >> > fa2c2baa64a5 100644
> >> > --- a/gcc/target.def
> >> > +++ b/gcc/target.def
> >> > @@ -1905,25 +1905,6 @@ implementation approaches itself.",
> >> >  	const vec_perm_indices &sel),
> >> >   NULL)
> >> >
> >> > -DEFHOOK
> >> > -(can_special_div_by_const,
> >> > - "This hook is used to test whether the target has a special
> >> > method of\n\ -division of vectors of type @var{vectype} using the
> >> > value @var{constant},\n\ -and producing a vector of type
> >> > @var{vectype}.  The division\n\ -will then not be decomposed by the
> >> > vectorizer and kept as a div.\n\ -\n\ -When the hook is being used
> >> > to test whether the target supports a special\n\ -divide,
> >> > @var{in0}, @var{in1}, and @var{output} are all null.  When the
> >> > hook\n\ -is being used to emit a division, @var{in0} and @var{in1}
> >> > are the source\n\ -vectors of type @var{vecttype} and @var{output}
> >> > is the destination vector of\n\ -type @var{vectype}.\n\ -\n\
> >> > -Return true if the operation is possible, emitting instructions
> >> > for it\n\ -if rtxes are provided and updating @var{output}.",
> >> > - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
> >> > -	rtx in0, rtx in1),
> >> > - default_can_special_div_by_const)
> >> > -
> >> >  /* Return true if the target supports misaligned store/load of a
> >> >     specific factor denoted in the third parameter.  The last parameter
> >> >     is true if the access is defined in a packed struct.  */ diff
> >> > --git a/gcc/target.h b/gcc/target.h index
> >> >
> >>
> 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b9
> >> 9f9
> >> > 13158c2d47b1 100644
> >> > --- a/gcc/target.h
> >> > +++ b/gcc/target.h
> >> > @@ -51,7 +51,6 @@
> >> >  #include "insn-codes.h"
> >> >  #include "tm.h"
> >> >  #include "hard-reg-set.h"
> >> > -#include "tree-core.h"
> >> >
> >> >  #if CHECKING_P
> >> >
> >> > diff --git a/gcc/targhooks.h b/gcc/targhooks.h index
> >> >
> >>
> a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad224454
> >> 93
> >> > 17a31390f0c2 100644
> >> > --- a/gcc/targhooks.h
> >> > +++ b/gcc/targhooks.h
> >> > @@ -209,8 +209,6 @@ extern void
> default_addr_space_diagnose_usage
> >> > (addr_space_t, location_t);  extern rtx default_addr_space_convert
> >> > (rtx, tree, tree);  extern unsigned int
> >> > default_case_values_threshold (void);  extern bool
> >> > default_have_conditional_execution (void); -extern bool
> >> > default_can_special_div_by_const (enum tree_code, tree,
> >> wide_int,
> >> > -					      rtx *, rtx, rtx);
> >> >
> >> >  extern bool default_libc_has_function (enum function_class, tree);
> >> > extern bool default_libc_has_fast_function (int fcode); diff --git
> >> > a/gcc/targhooks.cc b/gcc/targhooks.cc index
> >> >
> >>
> fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91e
> >> 03
> >> > 877337a931e7 100644
> >> > --- a/gcc/targhooks.cc
> >> > +++ b/gcc/targhooks.cc
> >> > @@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
> >> >    return HAVE_conditional_execution;  }
> >> >
> >> > -/* Default that no division by constant operations are special.
> >> > */ -bool -default_can_special_div_by_const (enum tree_code, tree,
> >> > wide_int, rtx *, rtx,
> >> > -				  rtx)
> >> > -{
> >> > -  return false;
> >> > -}
> >> > -
> >> >  /* By default we assume that c99 functions are present at the runtime,
> >> >     but sincos is not.  */
> >> >  bool
> >> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> >> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> >> > new file mode 100644
> >> > index
> >> >
> >>
> 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0
> >> a0
> >> > 4ea8c1f73e3c
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> >> > @@ -0,0 +1,25 @@
> >> > +/* { dg-require-effective-target vect_int } */
> >> > +
> >> > +#include <stdint.h>
> >> > +#include "tree-vect.h"
> >> > +
> >> > +typedef unsigned __attribute__((__vector_size__ (16))) V;
> >> > +
> >> > +static __attribute__((__noinline__)) __attribute__((__noclone__))
> >> > +V foo (V v, unsigned short i) {
> >> > +  v /= i;
> >> > +  return v;
> >> > +}
> >> > +
> >> > +int
> >> > +main (void)
> >> > +{
> >> > +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff
> >> > +}, 0xffff);
> >> > +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
> >> > +    if (v[i] != 0x00010001)
> >> > +      __builtin_abort ();
> >> > +  return 0;
> >> > +}
> >> > +
> >> > +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern:
> >> > +detected" "vect" { target aarch64*-*-* } } } */
> >> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> >> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> >> > new file mode 100644
> >> > index
> >> >
> >>
> 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b4991
> >> 4d2
> >> > a29b933de625
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> >> > @@ -0,0 +1,58 @@
> >> > +/* { dg-require-effective-target vect_int } */
> >> > +
> >> > +#include <stdint.h>
> >> > +#include <stdio.h>
> >> > +#include "tree-vect.h"
> >> > +
> >> > +#define N 50
> >> > +#define TYPE uint8_t
> >> > +
> >> > +#ifndef DEBUG
> >> > +#define DEBUG 0
> >> > +#endif
> >> > +
> >> > +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> >> > +
> >> > +
> >> > +__attribute__((noipa, noinline, optimize("O1"))) void fun1(TYPE*
> >> > +restrict pixel, TYPE level, int n) {
> >> > +  for (int i = 0; i < n; i+=1)
> >> > +    pixel[i] = (pixel[i] + level) / 0xff; }
> >> > +
> >> > +__attribute__((noipa, noinline, optimize("O3"))) void fun2(TYPE*
> >> > +restrict pixel, TYPE level, int n) {
> >> > +  for (int i = 0; i < n; i+=1)
> >> > +    pixel[i] = (pixel[i] + level) / 0xff; }
> >> > +
> >> > +int main ()
> >> > +{
> >> > +  TYPE a[N];
> >> > +  TYPE b[N];
> >> > +
> >> > +  for (int i = 0; i < N; ++i)
> >> > +    {
> >> > +      a[i] = BASE + i * 13;
> >> > +      b[i] = BASE + i * 13;
> >> > +      if (DEBUG)
> >> > +        printf ("%d: 0x%x\n", i, a[i]);
> >> > +    }
> >> > +
> >> > +  fun1 (a, N / 2, N);
> >> > +  fun2 (b, N / 2, N);
> >> > +
> >> > +  for (int i = 0; i < N; ++i)
> >> > +    {
> >> > +      if (DEBUG)
> >> > +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> >> > +
> >> > +      if (a[i] != b[i])
> >> > +        __builtin_abort ();
> >> > +    }
> >> > +  return 0;
> >> > +}
> >> > +
> >> > +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect"
> >> > +{ target aarch64*-*-* } } } */
> >> > diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
> >> > index
> >> >
> >>
> 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077d
> >> c3
> >> > e970bed75ef6 100644
> >> > --- a/gcc/tree-vect-generic.cc
> >> > +++ b/gcc/tree-vect-generic.cc
> >> > @@ -1237,17 +1237,6 @@ expand_vector_operation
> >> (gimple_stmt_iterator *gsi, tree type, tree compute_type
> >> >  	  tree rhs2 = gimple_assign_rhs2 (assign);
> >> >  	  tree ret;
> >> >
> >> > -	  /* Check if the target was going to handle it through the special
> >> > -	     division callback hook.  */
> >> > -	  tree cst = uniform_integer_cst_p (rhs2);
> >> > -	  if (cst &&
> >> > -	      targetm.vectorize.can_special_div_by_const (code, type,
> >> > -							  wi::to_wide (cst),
> >> > -							  NULL,
> >> > -							  NULL_RTX,
> >> NULL_RTX))
> >> > -	    return NULL_TREE;
> >> > -
> >> > -
> >> >  	  if (!optimize
> >> >  	      || !VECTOR_INTEGER_TYPE_P (type)
> >> >  	      || TREE_CODE (rhs2) != VECTOR_CST diff --git
> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
> >> >
> >>
> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
> >> 69
> >> > de2afea139d6 100644
> >> > --- a/gcc/tree-vect-patterns.cc
> >> > +++ b/gcc/tree-vect-patterns.cc
> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info
> *vinfo,
> >> >        return pattern_stmt;
> >> >      }
> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
> >> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
> >> vectype,
> >> > -							  wi::to_wide (cst),
> >> > -							  NULL, NULL_RTX,
> >> > -							  NULL_RTX))
> >> > +	   && TYPE_UNSIGNED (itype)
> >> > +	   && rhs_code == TRUNC_DIV_EXPR
> >> > +	   && vectype
> >> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> >> > +					      OPTIMIZE_FOR_SPEED))
> >> >      {
> >> > -      return NULL;
> >> > +      /* div optimizations using narrowings
> >> > +       we can do the division e.g. shorts by 255 faster by calculating it as
> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> >> > +       double the precision of x.
> >> > +
> >> > +       If we imagine a short as being composed of two blocks of bytes
> then
> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent
> to
> >> > +       adding 1 to each sub component:
> >> > +
> >> > +	    short value of 16-bits
> >> > +       ┌──────────────┬────────────────┐
> >> > +       │              │                │
> >> > +       └──────────────┴────────────────┘
> >> > +	 8-bit part1 ▲  8-bit part2   ▲
> >> > +		     │                │
> >> > +		     │                │
> >> > +		    +1               +1
> >> > +
> >> > +       after the first addition, we have to shift right by 8, and narrow the
> >> > +       results back to a byte.  Remember that the addition must be done
> in
> >> > +       double the precision of the input.  However if we know that
> >> > + the
> >> addition
> >> > +       `x + 257` does not overflow then we can do the operation in
> >> > + the
> >> current
> >> > +       precision.  In which case we don't need the pack and unpacks.  */
> >> > +      auto wcst = wi::to_wide (cst);
> >> > +      int pow = wi::exact_log2 (wcst + 1);
> >> > +      if (pow == (int) (element_precision (vectype) / 2))
> >> > +	{
> >> > +	  wide_int min,max;
> >> > +	  /* If we're in a pattern we need to find the orginal definition.  */
> >> > +	  tree op0 = oprnd0;
> >> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> >> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> >> > +	  if (is_pattern_stmt_p (stmt_info))
> >> > +	    {
> >> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
> >> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> >> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
> >> > +	    }
> >>
> >> If this is generally safe (I'm skipping thinking about it in the
> >> interests of a quick review :-)), then I think it should be done in
> >> vect_get_range_info instead.  Using gimple_get_lhs would be more
> >> general than handling just assignments.
> >>
> >> > +
> >> > +	  /* Check that no overflow will occur.  If we don't have range
> >> > +	     information we can't perform the optimization.  */
> >> > +	  if (vect_get_range_info (op0, &min, &max))
> >> > +	    {
> >> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
> >> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> >> > +	      wi::overflow_type ovf;
> >> > +	      /* We need adder and max in the same precision.  */
> >> > +	      wide_int zadder
> >> > +		= wide_int_storage::from (adder, wi::get_precision (max),
> >> > +					  UNSIGNED);
> >> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
> >>
> >> Could you explain this a bit more?  When do we have mismatched
> >> precisions?
> >
> > C promotion rules will promote e.g.
> >
> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
> >   for (int i = 0; i < n; i+=1)
> >     pixel[i] = (pixel[i] + level) / 0xff; }
> >
> > And have the addition be done as a 32 bit integer.  The vectorizer
> > will demote this down to a short, but range information is not stored
> > for patterns.  So In the above the range will correctly be 0x1fe but
> > the precision will be that of the original expression, so 32.  This
> > will be a mismatch with itype which is derived from the size the vectorizer
> will perform the operation in.
> >
> > Thanks,
> > Tamar
> >
> >>
> >> Thanks,
> >> Richard
> >>
> >> > +	      if (ovf == wi::OVF_NONE)
> >> > +		{
> >> > +		  *type_out = vectype;
> >> > +		  tree tadder = wide_int_to_tree (itype, adder);
> >> > +		  gcall *patt1
> >> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0,
> >> tadder);
> >> > +		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
> >> > +		  gimple_call_set_lhs (patt1, lhs);
> >> > +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1,
> >> vectype);
> >> > +
> >> > +		  pattern_stmt
> >> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
> >> > +		  lhs = vect_recog_temp_ssa_var (itype, NULL);
> >> > +		  gimple_call_set_lhs (pattern_stmt, lhs);
> >> > +
> >> > +		  return pattern_stmt;
> >> > +		}
> >> > +	    }
> >> > +	}
> >> >      }
> >> >
> >> >    if (prec > HOST_BITS_PER_WIDE_INT diff --git
> >> > a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index
> >> >
> >>
> eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b95
> >> 64f
> >> > c4e066e50081 100644
> >> > --- a/gcc/tree-vect-stmts.cc
> >> > +++ b/gcc/tree-vect-stmts.cc
> >> > @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
> >> >  	}
> >> >        target_support_p = (optab_handler (optab, vec_mode)
> >> >  			  != CODE_FOR_nothing);
> >> > -      tree cst;
> >> > -      if (!target_support_p
> >> > -	  && op1
> >> > -	  && (cst = uniform_integer_cst_p (op1)))
> >> > -	target_support_p
> >> > -	  = targetm.vectorize.can_special_div_by_const (code, vectype,
> >> > -							wi::to_wide (cst),
> >> > -							NULL, NULL_RTX,
> >> > -							NULL_RTX);
> >> >      }
> >> >
> >> >    bool using_emulated_vectors_p = vect_emulated_vector_p
> >> > (vectype);
Richard Sandiford Feb. 10, 2023, 3:56 p.m. UTC | #8
Tamar Christina <Tamar.Christina@arm.com> writes:
>> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
>> >
>> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
>> 69
>> > de2afea139d6 100644
>> > --- a/gcc/tree-vect-patterns.cc
>> > +++ b/gcc/tree-vect-patterns.cc
>> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info *vinfo,
>> >        return pattern_stmt;
>> >      }
>> >    else if ((cst = uniform_integer_cst_p (oprnd1))
>> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
>> vectype,
>> > -							  wi::to_wide (cst),
>> > -							  NULL, NULL_RTX,
>> > -							  NULL_RTX))
>> > +	   && TYPE_UNSIGNED (itype)
>> > +	   && rhs_code == TRUNC_DIV_EXPR
>> > +	   && vectype
>> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
>> > +					      OPTIMIZE_FOR_SPEED))
>> >      {
>> > -      return NULL;
>> > +      /* div optimizations using narrowings
>> > +       we can do the division e.g. shorts by 255 faster by calculating it as
>> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
>> > +       double the precision of x.
>> > +
>> > +       If we imagine a short as being composed of two blocks of bytes then
>> > +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
>> > +       adding 1 to each sub component:
>> > +
>> > +	    short value of 16-bits
>> > +       ┌──────────────┬────────────────┐
>> > +       │              │                │
>> > +       └──────────────┴────────────────┘
>> > +	 8-bit part1 ▲  8-bit part2   ▲
>> > +		     │                │
>> > +		     │                │
>> > +		    +1               +1
>> > +
>> > +       after the first addition, we have to shift right by 8, and narrow the
>> > +       results back to a byte.  Remember that the addition must be done in
>> > +       double the precision of the input.  However if we know that the
>> addition
>> > +       `x + 257` does not overflow then we can do the operation in the
>> current
>> > +       precision.  In which case we don't need the pack and unpacks.  */
>> > +      auto wcst = wi::to_wide (cst);
>> > +      int pow = wi::exact_log2 (wcst + 1);
>> > +      if (pow == (int) (element_precision (vectype) / 2))
>> > +	{
>> > +	  wide_int min,max;
>> > +	  /* If we're in a pattern we need to find the orginal definition.  */
>> > +	  tree op0 = oprnd0;
>> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
>> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
>> > +	  if (is_pattern_stmt_p (stmt_info))
>> > +	    {
>> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
>> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
>> > +	    }
>> 
>> If this is generally safe (I'm skipping thinking about it in the interests of a
>> quick review :-)), then I think it should be done in vect_get_range_info
>> instead.  Using gimple_get_lhs would be more general than handling just
>> assignments.
>> 
>> > +
>> > +	  /* Check that no overflow will occur.  If we don't have range
>> > +	     information we can't perform the optimization.  */
>> > +	  if (vect_get_range_info (op0, &min, &max))
>> > +	    {
>> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
>> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
>> > +	      wi::overflow_type ovf;
>> > +	      /* We need adder and max in the same precision.  */
>> > +	      wide_int zadder
>> > +		= wide_int_storage::from (adder, wi::get_precision (max),
>> > +					  UNSIGNED);
>> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
>> 
>> Could you explain this a bit more?  When do we have mismatched
>> precisions?
>
> C promotion rules will promote e.g.
>
> void fun2(uint8_t* restrict pixel, uint8_t level, int n)
> {
>   for (int i = 0; i < n; i+=1)
>     pixel[i] = (pixel[i] + level) / 0xff;
> }
>
> And have the addition be done as a 32 bit integer.  The vectorizer will demote this down
> to a short, but range information is not stored for patterns.  So In the above the range will
> correctly be 0x1fe but the precision will be that of the original expression, so 32.  This will
> be a mismatch with itype which is derived from the size the vectorizer will perform the
> operation in.

Gah, missed this first time round, sorry.

Richi would know better than me, but I think it's dangerous to rely on
the orig/pattern link for range information.  The end result of a pattern
(vect_stmt_to_vectorize) has to have the same type as the lhs of the
original statement.  But the other statements in the pattern sequence
can do arbitrary things.  Their range isn't predictable from the range
of the original statement result.

IIRC, the addition above is converted to:

  a' = (uint16_t) pixel[i]
  b' = (uint16_t) level
  sum' = a' + b'
  sum = (int) sum'

where sum is the direct replacement of "pixel[i] + level", with the
same type and range.  The division then uses sum' instead of sum.

But the fact that sum' is part of the same pattern as sum doesn't
guarantee that sum' has the same range as sum.  E.g. the pattern
statements added by the division optimisation wouldn't have this
property.

Is it possible to tell ranger to compute the range of expressions that
haven't been added to the IL?  (Genuine question, haven't looked.
It seems pretty powerful though.)

Thanks,
Richard
Tamar Christina Feb. 10, 2023, 4:09 p.m. UTC | #9
> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Friday, February 10, 2023 3:57 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
> >> >
> >>
> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
> >> 69
> >> > de2afea139d6 100644
> >> > --- a/gcc/tree-vect-patterns.cc
> >> > +++ b/gcc/tree-vect-patterns.cc
> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info
> *vinfo,
> >> >        return pattern_stmt;
> >> >      }
> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
> >> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
> >> vectype,
> >> > -							  wi::to_wide (cst),
> >> > -							  NULL, NULL_RTX,
> >> > -							  NULL_RTX))
> >> > +	   && TYPE_UNSIGNED (itype)
> >> > +	   && rhs_code == TRUNC_DIV_EXPR
> >> > +	   && vectype
> >> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> >> > +					      OPTIMIZE_FOR_SPEED))
> >> >      {
> >> > -      return NULL;
> >> > +      /* div optimizations using narrowings
> >> > +       we can do the division e.g. shorts by 255 faster by calculating it as
> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> >> > +       double the precision of x.
> >> > +
> >> > +       If we imagine a short as being composed of two blocks of bytes
> then
> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent
> to
> >> > +       adding 1 to each sub component:
> >> > +
> >> > +	    short value of 16-bits
> >> > +       ┌──────────────┬────────────────┐
> >> > +       │              │                │
> >> > +       └──────────────┴────────────────┘
> >> > +	 8-bit part1 ▲  8-bit part2   ▲
> >> > +		     │                │
> >> > +		     │                │
> >> > +		    +1               +1
> >> > +
> >> > +       after the first addition, we have to shift right by 8, and narrow the
> >> > +       results back to a byte.  Remember that the addition must be done
> in
> >> > +       double the precision of the input.  However if we know that
> >> > + the
> >> addition
> >> > +       `x + 257` does not overflow then we can do the operation in
> >> > + the
> >> current
> >> > +       precision.  In which case we don't need the pack and unpacks.  */
> >> > +      auto wcst = wi::to_wide (cst);
> >> > +      int pow = wi::exact_log2 (wcst + 1);
> >> > +      if (pow == (int) (element_precision (vectype) / 2))
> >> > +	{
> >> > +	  wide_int min,max;
> >> > +	  /* If we're in a pattern we need to find the orginal definition.  */
> >> > +	  tree op0 = oprnd0;
> >> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> >> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> >> > +	  if (is_pattern_stmt_p (stmt_info))
> >> > +	    {
> >> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
> >> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> >> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
> >> > +	    }
> >>
> >> If this is generally safe (I'm skipping thinking about it in the
> >> interests of a quick review :-)), then I think it should be done in
> >> vect_get_range_info instead.  Using gimple_get_lhs would be more
> >> general than handling just assignments.
> >>
> >> > +
> >> > +	  /* Check that no overflow will occur.  If we don't have range
> >> > +	     information we can't perform the optimization.  */
> >> > +	  if (vect_get_range_info (op0, &min, &max))
> >> > +	    {
> >> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
> >> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> >> > +	      wi::overflow_type ovf;
> >> > +	      /* We need adder and max in the same precision.  */
> >> > +	      wide_int zadder
> >> > +		= wide_int_storage::from (adder, wi::get_precision (max),
> >> > +					  UNSIGNED);
> >> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
> >>
> >> Could you explain this a bit more?  When do we have mismatched
> >> precisions?
> >
> > C promotion rules will promote e.g.
> >
> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
> >   for (int i = 0; i < n; i+=1)
> >     pixel[i] = (pixel[i] + level) / 0xff; }
> >
> > And have the addition be done as a 32 bit integer.  The vectorizer
> > will demote this down to a short, but range information is not stored
> > for patterns.  So In the above the range will correctly be 0x1fe but
> > the precision will be that of the original expression, so 32.  This
> > will be a mismatch with itype which is derived from the size the vectorizer
> will perform the operation in.
> 
> Gah, missed this first time round, sorry.
> 
> Richi would know better than me, but I think it's dangerous to rely on the
> orig/pattern link for range information.  The end result of a pattern
> (vect_stmt_to_vectorize) has to have the same type as the lhs of the original
> statement.  But the other statements in the pattern sequence can do
> arbitrary things.  Their range isn't predictable from the range of the original
> statement result.
> 
> IIRC, the addition above is converted to:
> 
>   a' = (uint16_t) pixel[i]
>   b' = (uint16_t) level
>   sum' = a' + b'
>   sum = (int) sum'
> 
> where sum is the direct replacement of "pixel[i] + level", with the same type
> and range.  The division then uses sum' instead of sum.
> 
> But the fact that sum' is part of the same pattern as sum doesn't guarantee
> that sum' has the same range as sum.  E.g. the pattern statements added by
> the division optimisation wouldn't have this property.

So my assumption is that no pattern would replace a statement with something
That has higher precision than the C statement. The pattern above is demoted
By the vectorizer based on range information already. My assumption was that
the precision can only ever be smaller, because otherwise the pattern has violated
the semantics of the C code, which would be dangerous if e.g. the expression escapes?

> 
> Is it possible to tell ranger to compute the range of expressions that haven't
> been added to the IL?  (Genuine question, haven't looked.
> It seems pretty powerful though.)

I don't know either, I guess for things it has explicit knowledge about it's ok, so
+w or *w would be fine, but with a random IFN_ it'll likely have to punt as varying.

I guess while theoretically possible, I don't see a case where the vectorizer would
Introduce a higher precision, as this would reduce your VF.

The only place I can think of that this is unsafe is if the division is introduced as part
of another pattern, but in that case the pattern won't have a related statement so
we'll punt.

Regards,
Tamar

> 
> Thanks,
> Richard
Richard Sandiford Feb. 10, 2023, 4:25 p.m. UTC | #10
Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Friday, February 10, 2023 3:57 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
>> by using new optabs [PR108583]
>> 
>> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
>> >> >
>> >>
>> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
>> >> 69
>> >> > de2afea139d6 100644
>> >> > --- a/gcc/tree-vect-patterns.cc
>> >> > +++ b/gcc/tree-vect-patterns.cc
>> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info
>> *vinfo,
>> >> >        return pattern_stmt;
>> >> >      }
>> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
>> >> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
>> >> vectype,
>> >> > -							  wi::to_wide (cst),
>> >> > -							  NULL, NULL_RTX,
>> >> > -							  NULL_RTX))
>> >> > +	   && TYPE_UNSIGNED (itype)
>> >> > +	   && rhs_code == TRUNC_DIV_EXPR
>> >> > +	   && vectype
>> >> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
>> >> > +					      OPTIMIZE_FOR_SPEED))
>> >> >      {
>> >> > -      return NULL;
>> >> > +      /* div optimizations using narrowings
>> >> > +       we can do the division e.g. shorts by 255 faster by calculating it as
>> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
>> >> > +       double the precision of x.
>> >> > +
>> >> > +       If we imagine a short as being composed of two blocks of bytes
>> then
>> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent
>> to
>> >> > +       adding 1 to each sub component:
>> >> > +
>> >> > +	    short value of 16-bits
>> >> > +       ┌──────────────┬────────────────┐
>> >> > +       │              │                │
>> >> > +       └──────────────┴────────────────┘
>> >> > +	 8-bit part1 ▲  8-bit part2   ▲
>> >> > +		     │                │
>> >> > +		     │                │
>> >> > +		    +1               +1
>> >> > +
>> >> > +       after the first addition, we have to shift right by 8, and narrow the
>> >> > +       results back to a byte.  Remember that the addition must be done
>> in
>> >> > +       double the precision of the input.  However if we know that
>> >> > + the
>> >> addition
>> >> > +       `x + 257` does not overflow then we can do the operation in
>> >> > + the
>> >> current
>> >> > +       precision.  In which case we don't need the pack and unpacks.  */
>> >> > +      auto wcst = wi::to_wide (cst);
>> >> > +      int pow = wi::exact_log2 (wcst + 1);
>> >> > +      if (pow == (int) (element_precision (vectype) / 2))
>> >> > +	{
>> >> > +	  wide_int min,max;
>> >> > +	  /* If we're in a pattern we need to find the orginal definition.  */
>> >> > +	  tree op0 = oprnd0;
>> >> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
>> >> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
>> >> > +	  if (is_pattern_stmt_p (stmt_info))
>> >> > +	    {
>> >> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
>> >> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>> >> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
>> >> > +	    }
>> >>
>> >> If this is generally safe (I'm skipping thinking about it in the
>> >> interests of a quick review :-)), then I think it should be done in
>> >> vect_get_range_info instead.  Using gimple_get_lhs would be more
>> >> general than handling just assignments.
>> >>
>> >> > +
>> >> > +	  /* Check that no overflow will occur.  If we don't have range
>> >> > +	     information we can't perform the optimization.  */
>> >> > +	  if (vect_get_range_info (op0, &min, &max))
>> >> > +	    {
>> >> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
>> >> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
>> >> > +	      wi::overflow_type ovf;
>> >> > +	      /* We need adder and max in the same precision.  */
>> >> > +	      wide_int zadder
>> >> > +		= wide_int_storage::from (adder, wi::get_precision (max),
>> >> > +					  UNSIGNED);
>> >> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
>> >>
>> >> Could you explain this a bit more?  When do we have mismatched
>> >> precisions?
>> >
>> > C promotion rules will promote e.g.
>> >
>> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
>> >   for (int i = 0; i < n; i+=1)
>> >     pixel[i] = (pixel[i] + level) / 0xff; }
>> >
>> > And have the addition be done as a 32 bit integer.  The vectorizer
>> > will demote this down to a short, but range information is not stored
>> > for patterns.  So In the above the range will correctly be 0x1fe but
>> > the precision will be that of the original expression, so 32.  This
>> > will be a mismatch with itype which is derived from the size the vectorizer
>> will perform the operation in.
>> 
>> Gah, missed this first time round, sorry.
>> 
>> Richi would know better than me, but I think it's dangerous to rely on the
>> orig/pattern link for range information.  The end result of a pattern
>> (vect_stmt_to_vectorize) has to have the same type as the lhs of the original
>> statement.  But the other statements in the pattern sequence can do
>> arbitrary things.  Their range isn't predictable from the range of the original
>> statement result.
>> 
>> IIRC, the addition above is converted to:
>> 
>>   a' = (uint16_t) pixel[i]
>>   b' = (uint16_t) level
>>   sum' = a' + b'
>>   sum = (int) sum'
>> 
>> where sum is the direct replacement of "pixel[i] + level", with the same type
>> and range.  The division then uses sum' instead of sum.
>> 
>> But the fact that sum' is part of the same pattern as sum doesn't guarantee
>> that sum' has the same range as sum.  E.g. the pattern statements added by
>> the division optimisation wouldn't have this property.
>
> So my assumption is that no pattern would replace a statement with something
> That has higher precision than the C statement. The pattern above is demoted
> By the vectorizer based on range information already. My assumption was that
> the precision can only ever be smaller, because otherwise the pattern has violated
> the semantics of the C code, which would be dangerous if e.g. the expression escapes?

IMO the difference in precisions was a symptom of the problem rather
than the direct cause.

The point is more that "B = vect_orig_stmt(A)" just says "A is used
somehow in a new calculation of B".  A might equal B (if A replaces B),
or A might be an arbitrary temporary result.  The code above is instead
using it to mean "A equals B, expressed in a different type".  That
happens to be true for sum' in the sequence above, but it isn't true of
non-final pattern statements in general.

In other words, the code hasn't proved that the path from A to
vect_stmt_to_vectorize(B) just involves conversions.

Applying the range of a pattern result to all temporary results in
the pattern could lead to wrong results even when the precisions
are all the same.

>> Is it possible to tell ranger to compute the range of expressions that haven't
>> been added to the IL?  (Genuine question, haven't looked.
>> It seems pretty powerful though.)
>
> I don't know either, I guess for things it has explicit knowledge about it's ok, so
> +w or *w would be fine, but with a random IFN_ it'll likely have to punt as varying.

Yeah.  But sum' above involves simple arithmetic and conversions,
so IFNs shouldn't be a problem.

Thanks,
Richard
Tamar Christina Feb. 10, 2023, 4:33 p.m. UTC | #11
> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Friday, February 10, 2023 4:25 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> Sent: Friday, February 10, 2023 3:57 PM
> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
> >> div-bitmask by using new optabs [PR108583]
> >>
> >> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
> >> >> >
> >> >>
> >>
> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
> >> >> 69
> >> >> > de2afea139d6 100644
> >> >> > --- a/gcc/tree-vect-patterns.cc
> >> >> > +++ b/gcc/tree-vect-patterns.cc
> >> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info
> >> *vinfo,
> >> >> >        return pattern_stmt;
> >> >> >      }
> >> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
> >> >> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
> >> >> vectype,
> >> >> > -							  wi::to_wide
> (cst),
> >> >> > -							  NULL,
> NULL_RTX,
> >> >> > -							  NULL_RTX))
> >> >> > +	   && TYPE_UNSIGNED (itype)
> >> >> > +	   && rhs_code == TRUNC_DIV_EXPR
> >> >> > +	   && vectype
> >> >> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> >> >> > +					      OPTIMIZE_FOR_SPEED))
> >> >> >      {
> >> >> > -      return NULL;
> >> >> > +      /* div optimizations using narrowings
> >> >> > +       we can do the division e.g. shorts by 255 faster by calculating it
> as
> >> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> >> >> > +       double the precision of x.
> >> >> > +
> >> >> > +       If we imagine a short as being composed of two blocks of
> >> >> > + bytes
> >> then
> >> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is
> >> >> > + equivalent
> >> to
> >> >> > +       adding 1 to each sub component:
> >> >> > +
> >> >> > +	    short value of 16-bits
> >> >> > +       ┌──────────────┬────────────────┐
> >> >> > +       │              │                │
> >> >> > +       └──────────────┴────────────────┘
> >> >> > +	 8-bit part1 ▲  8-bit part2   ▲
> >> >> > +		     │                │
> >> >> > +		     │                │
> >> >> > +		    +1               +1
> >> >> > +
> >> >> > +       after the first addition, we have to shift right by 8, and narrow
> the
> >> >> > +       results back to a byte.  Remember that the addition must
> >> >> > + be done
> >> in
> >> >> > +       double the precision of the input.  However if we know
> >> >> > + that the
> >> >> addition
> >> >> > +       `x + 257` does not overflow then we can do the operation
> >> >> > + in the
> >> >> current
> >> >> > +       precision.  In which case we don't need the pack and unpacks.
> */
> >> >> > +      auto wcst = wi::to_wide (cst);
> >> >> > +      int pow = wi::exact_log2 (wcst + 1);
> >> >> > +      if (pow == (int) (element_precision (vectype) / 2))
> >> >> > +	{
> >> >> > +	  wide_int min,max;
> >> >> > +	  /* If we're in a pattern we need to find the orginal
> definition.  */
> >> >> > +	  tree op0 = oprnd0;
> >> >> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> >> >> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> >> >> > +	  if (is_pattern_stmt_p (stmt_info))
> >> >> > +	    {
> >> >> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT
> (stmt_info);
> >> >> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> >> >> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT
> (orig_stmt));
> >> >> > +	    }
> >> >>
> >> >> If this is generally safe (I'm skipping thinking about it in the
> >> >> interests of a quick review :-)), then I think it should be done
> >> >> in vect_get_range_info instead.  Using gimple_get_lhs would be
> >> >> more general than handling just assignments.
> >> >>
> >> >> > +
> >> >> > +	  /* Check that no overflow will occur.  If we don't have range
> >> >> > +	     information we can't perform the optimization.  */
> >> >> > +	  if (vect_get_range_info (op0, &min, &max))
> >> >> > +	    {
> >> >> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
> >> >> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> >> >> > +	      wi::overflow_type ovf;
> >> >> > +	      /* We need adder and max in the same precision.  */
> >> >> > +	      wide_int zadder
> >> >> > +		= wide_int_storage::from (adder, wi::get_precision
> (max),
> >> >> > +					  UNSIGNED);
> >> >> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
> >> >>
> >> >> Could you explain this a bit more?  When do we have mismatched
> >> >> precisions?
> >> >
> >> > C promotion rules will promote e.g.
> >> >
> >> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
> >> >   for (int i = 0; i < n; i+=1)
> >> >     pixel[i] = (pixel[i] + level) / 0xff; }
> >> >
> >> > And have the addition be done as a 32 bit integer.  The vectorizer
> >> > will demote this down to a short, but range information is not
> >> > stored for patterns.  So In the above the range will correctly be
> >> > 0x1fe but the precision will be that of the original expression, so
> >> > 32.  This will be a mismatch with itype which is derived from the
> >> > size the vectorizer
> >> will perform the operation in.
> >>
> >> Gah, missed this first time round, sorry.
> >>
> >> Richi would know better than me, but I think it's dangerous to rely
> >> on the orig/pattern link for range information.  The end result of a
> >> pattern
> >> (vect_stmt_to_vectorize) has to have the same type as the lhs of the
> >> original statement.  But the other statements in the pattern sequence
> >> can do arbitrary things.  Their range isn't predictable from the
> >> range of the original statement result.
> >>
> >> IIRC, the addition above is converted to:
> >>
> >>   a' = (uint16_t) pixel[i]
> >>   b' = (uint16_t) level
> >>   sum' = a' + b'
> >>   sum = (int) sum'
> >>
> >> where sum is the direct replacement of "pixel[i] + level", with the
> >> same type and range.  The division then uses sum' instead of sum.
> >>
> >> But the fact that sum' is part of the same pattern as sum doesn't
> >> guarantee that sum' has the same range as sum.  E.g. the pattern
> >> statements added by the division optimisation wouldn't have this
> property.
> >
> > So my assumption is that no pattern would replace a statement with
> > something That has higher precision than the C statement. The pattern
> > above is demoted By the vectorizer based on range information already.
> > My assumption was that the precision can only ever be smaller, because
> > otherwise the pattern has violated the semantics of the C code, which
> would be dangerous if e.g. the expression escapes?
> 
> IMO the difference in precisions was a symptom of the problem rather than
> the direct cause.
> 
> The point is more that "B = vect_orig_stmt(A)" just says "A is used somehow
> in a new calculation of B".  A might equal B (if A replaces B), or A might be an
> arbitrary temporary result.  The code above is instead using it to mean "A
> equals B, expressed in a different type".  That happens to be true for sum' in
> the sequence above, but it isn't true of non-final pattern statements in
> general.
> 

Sorry for being dense, but I though that's exactly what the code does and what I
tried explain before. If B isn't a final statement than it won't have an original statement.
AFAIK, the only places we set original statement is the root of the pattern expression.

> In other words, the code hasn't proved that the path from A to
> vect_stmt_to_vectorize(B) just involves conversions.
> 
> Applying the range of a pattern result to all temporary results in the pattern
> could lead to wrong results even when the precisions are all the same.

But maybe I'm misremembering here. I don't believe we'd ever match in the middle of
A multi pattern sequence because the additional patterns are not emitted in the instruction
stream.  That's why we have append_pattern_def_seq which appends the additional
statemens to the pattern's def sequence.

Unlike the original seed for the pattern, these aren't materialized until codegen or SLP build.

But I could be wrong...

Tamar

> 
> >> Is it possible to tell ranger to compute the range of expressions
> >> that haven't been added to the IL?  (Genuine question, haven't looked.
> >> It seems pretty powerful though.)
> >
> > I don't know either, I guess for things it has explicit knowledge
> > about it's ok, so
> > +w or *w would be fine, but with a random IFN_ it'll likely have to punt as
> varying.
> 
> Yeah.  But sum' above involves simple arithmetic and conversions, so IFNs
> shouldn't be a problem.
> 
> Thanks,
> Richard
Richard Sandiford Feb. 10, 2023, 4:57 p.m. UTC | #12
Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Friday, February 10, 2023 4:25 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
>> by using new optabs [PR108583]
>> 
>> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> -----Original Message-----
>> >> From: Richard Sandiford <richard.sandiford@arm.com>
>> >> Sent: Friday, February 10, 2023 3:57 PM
>> >> To: Tamar Christina <Tamar.Christina@arm.com>
>> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
>> >> div-bitmask by using new optabs [PR108583]
>> >>
>> >> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
>> >> >> >
>> >> >>
>> >>
>> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
>> >> >> 69
>> >> >> > de2afea139d6 100644
>> >> >> > --- a/gcc/tree-vect-patterns.cc
>> >> >> > +++ b/gcc/tree-vect-patterns.cc
>> >> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info
>> >> *vinfo,
>> >> >> >        return pattern_stmt;
>> >> >> >      }
>> >> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
>> >> >> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
>> >> >> vectype,
>> >> >> > -							  wi::to_wide
>> (cst),
>> >> >> > -							  NULL,
>> NULL_RTX,
>> >> >> > -							  NULL_RTX))
>> >> >> > +	   && TYPE_UNSIGNED (itype)
>> >> >> > +	   && rhs_code == TRUNC_DIV_EXPR
>> >> >> > +	   && vectype
>> >> >> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
>> >> >> > +					      OPTIMIZE_FOR_SPEED))
>> >> >> >      {
>> >> >> > -      return NULL;
>> >> >> > +      /* div optimizations using narrowings
>> >> >> > +       we can do the division e.g. shorts by 255 faster by calculating it
>> as
>> >> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
>> >> >> > +       double the precision of x.
>> >> >> > +
>> >> >> > +       If we imagine a short as being composed of two blocks of
>> >> >> > + bytes
>> >> then
>> >> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is
>> >> >> > + equivalent
>> >> to
>> >> >> > +       adding 1 to each sub component:
>> >> >> > +
>> >> >> > +	    short value of 16-bits
>> >> >> > +       ┌──────────────┬────────────────┐
>> >> >> > +       │              │                │
>> >> >> > +       └──────────────┴────────────────┘
>> >> >> > +	 8-bit part1 ▲  8-bit part2   ▲
>> >> >> > +		     │                │
>> >> >> > +		     │                │
>> >> >> > +		    +1               +1
>> >> >> > +
>> >> >> > +       after the first addition, we have to shift right by 8, and narrow
>> the
>> >> >> > +       results back to a byte.  Remember that the addition must
>> >> >> > + be done
>> >> in
>> >> >> > +       double the precision of the input.  However if we know
>> >> >> > + that the
>> >> >> addition
>> >> >> > +       `x + 257` does not overflow then we can do the operation
>> >> >> > + in the
>> >> >> current
>> >> >> > +       precision.  In which case we don't need the pack and unpacks.
>> */
>> >> >> > +      auto wcst = wi::to_wide (cst);
>> >> >> > +      int pow = wi::exact_log2 (wcst + 1);
>> >> >> > +      if (pow == (int) (element_precision (vectype) / 2))
>> >> >> > +	{
>> >> >> > +	  wide_int min,max;
>> >> >> > +	  /* If we're in a pattern we need to find the orginal
>> definition.  */
>> >> >> > +	  tree op0 = oprnd0;
>> >> >> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
>> >> >> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
>> >> >> > +	  if (is_pattern_stmt_p (stmt_info))
>> >> >> > +	    {
>> >> >> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT
>> (stmt_info);
>> >> >> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>> >> >> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT
>> (orig_stmt));
>> >> >> > +	    }
>> >> >>
>> >> >> If this is generally safe (I'm skipping thinking about it in the
>> >> >> interests of a quick review :-)), then I think it should be done
>> >> >> in vect_get_range_info instead.  Using gimple_get_lhs would be
>> >> >> more general than handling just assignments.
>> >> >>
>> >> >> > +
>> >> >> > +	  /* Check that no overflow will occur.  If we don't have range
>> >> >> > +	     information we can't perform the optimization.  */
>> >> >> > +	  if (vect_get_range_info (op0, &min, &max))
>> >> >> > +	    {
>> >> >> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
>> >> >> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
>> >> >> > +	      wi::overflow_type ovf;
>> >> >> > +	      /* We need adder and max in the same precision.  */
>> >> >> > +	      wide_int zadder
>> >> >> > +		= wide_int_storage::from (adder, wi::get_precision
>> (max),
>> >> >> > +					  UNSIGNED);
>> >> >> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
>> >> >>
>> >> >> Could you explain this a bit more?  When do we have mismatched
>> >> >> precisions?
>> >> >
>> >> > C promotion rules will promote e.g.
>> >> >
>> >> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
>> >> >   for (int i = 0; i < n; i+=1)
>> >> >     pixel[i] = (pixel[i] + level) / 0xff; }
>> >> >
>> >> > And have the addition be done as a 32 bit integer.  The vectorizer
>> >> > will demote this down to a short, but range information is not
>> >> > stored for patterns.  So In the above the range will correctly be
>> >> > 0x1fe but the precision will be that of the original expression, so
>> >> > 32.  This will be a mismatch with itype which is derived from the
>> >> > size the vectorizer
>> >> will perform the operation in.
>> >>
>> >> Gah, missed this first time round, sorry.
>> >>
>> >> Richi would know better than me, but I think it's dangerous to rely
>> >> on the orig/pattern link for range information.  The end result of a
>> >> pattern
>> >> (vect_stmt_to_vectorize) has to have the same type as the lhs of the
>> >> original statement.  But the other statements in the pattern sequence
>> >> can do arbitrary things.  Their range isn't predictable from the
>> >> range of the original statement result.
>> >>
>> >> IIRC, the addition above is converted to:
>> >>
>> >>   a' = (uint16_t) pixel[i]
>> >>   b' = (uint16_t) level
>> >>   sum' = a' + b'
>> >>   sum = (int) sum'
>> >>
>> >> where sum is the direct replacement of "pixel[i] + level", with the
>> >> same type and range.  The division then uses sum' instead of sum.
>> >>
>> >> But the fact that sum' is part of the same pattern as sum doesn't
>> >> guarantee that sum' has the same range as sum.  E.g. the pattern
>> >> statements added by the division optimisation wouldn't have this
>> property.
>> >
>> > So my assumption is that no pattern would replace a statement with
>> > something That has higher precision than the C statement. The pattern
>> > above is demoted By the vectorizer based on range information already.
>> > My assumption was that the precision can only ever be smaller, because
>> > otherwise the pattern has violated the semantics of the C code, which
>> would be dangerous if e.g. the expression escapes?
>> 
>> IMO the difference in precisions was a symptom of the problem rather than
>> the direct cause.
>> 
>> The point is more that "B = vect_orig_stmt(A)" just says "A is used somehow
>> in a new calculation of B".  A might equal B (if A replaces B), or A might be an
>> arbitrary temporary result.  The code above is instead using it to mean "A
>> equals B, expressed in a different type".  That happens to be true for sum' in
>> the sequence above, but it isn't true of non-final pattern statements in
>> general.
>> 
>
> Sorry for being dense, but I though that's exactly what the code does and what I
> tried explain before. If B isn't a final statement than it won't have an original statement.
> AFAIK, the only places we set original statement is the root of the pattern expression.

Final pattern statements (those not in DEF_SEQ) always have the same
type and value as the original statements.  We wouldn't see mismatched
precisions if we were only looking at final pattern statements.

Like you say, the 16-bit addition didn't exist before vectorisation
(it was a 32-bit addition instead).  So to make things type-correct,
the 32-bit addition:

   A: sum = a + b           (STMT_VINFO_RELATED_STMT == A2)

is replaced with:

   DEF_SEQ:
     A1: tmp = a' + b'      (STMT_VINFO_RELATED_STMT == A)
   A2: sum' = (int) tmp     (STMT_VINFO_RELATED_STMT == A)

(using different notation from before, just to confuse things).
Here, A2 is the final pattern statement for A and A1 is just a
temporary result.  sum == sum'.

Later, we do a similar thing for the division itself.  We have:

   B: quotient = sum / 0xff            (STMT_VINFO_RELATED_STMT == B2)

We realise that this can be a 16-bit division, so (IIRC) we use
vect_look_through_possible_promotion on sum to find the best
starting point.  This should give:

   DEF_SEQ:
     B1: tmp2 = tmp / (uint16_t) 0xff  (STMT_VINFO_RELATED_STMT == B)
   B2: quotient' = (int) tmp2          (STMT_VINFO_RELATED_STMT == B)

Both changes are done by vect_widened_op_tree.

We then apply the division pattern to B1.  B1 is a nonfinal pattern
statement that uses the result (tmp) of another nonfinal pattern
statement (A1).

The code does:

	  if (is_pattern_stmt_p (stmt_info))
	    {
	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
	    }

is_pattern_stmt_p is true for both A1 and A2, and STMT_VINFO_RELATED_STMT
is A for both A1 and A2.  I would expect:

  gcc_assert (stmt_info == vect_stmt_to_vectorize (orig_stmt));

(testing for a final pattern) to fail for the motivating example.

Thanks,
Richard
Richard Sandiford Feb. 10, 2023, 5:01 p.m. UTC | #13
Richard Sandiford <richard.sandiford@arm.com> writes:
> Final pattern statements (those not in DEF_SEQ) always have the same
> type and value as the original statements.  We wouldn't see mismatched
> precisions if we were only looking at final pattern statements.
>
> Like you say, the 16-bit addition didn't exist before vectorisation
> (it was a 32-bit addition instead).  So to make things type-correct,
> the 32-bit addition:
>
>    A: sum = a + b           (STMT_VINFO_RELATED_STMT == A2)
>
> is replaced with:
>
>    DEF_SEQ:
>      A1: tmp = a' + b'      (STMT_VINFO_RELATED_STMT == A)
>    A2: sum' = (int) tmp     (STMT_VINFO_RELATED_STMT == A)
>
> (using different notation from before, just to confuse things).
> Here, A2 is the final pattern statement for A and A1 is just a
> temporary result.  sum == sum'.
>
> Later, we do a similar thing for the division itself.  We have:
>
>    B: quotient = sum / 0xff            (STMT_VINFO_RELATED_STMT == B2)
>
> We realise that this can be a 16-bit division, so (IIRC) we use
> vect_look_through_possible_promotion on sum to find the best
> starting point.  This should give:
>
>    DEF_SEQ:
>      B1: tmp2 = tmp / (uint16_t) 0xff  (STMT_VINFO_RELATED_STMT == B)
>    B2: quotient' = (int) tmp2          (STMT_VINFO_RELATED_STMT == B)
>
> Both changes are done by vect_widened_op_tree.

Eh, I meant vect_recog_over_widening_pattern.

Richard
Tamar Christina Feb. 10, 2023, 5:14 p.m. UTC | #14
> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Friday, February 10, 2023 4:57 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> Sent: Friday, February 10, 2023 4:25 PM
> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
> >> div-bitmask by using new optabs [PR108583]
> >>
> >> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> -----Original Message-----
> >> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> >> Sent: Friday, February 10, 2023 3:57 PM
> >> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> >> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> >> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
> >> >> div-bitmask by using new optabs [PR108583]
> >> >>
> >> >> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
> >> >> >> >
> >> >> >>
> >> >>
> >>
> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
> >> >> >> 69
> >> >> >> > de2afea139d6 100644
> >> >> >> > --- a/gcc/tree-vect-patterns.cc
> >> >> >> > +++ b/gcc/tree-vect-patterns.cc
> >> >> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern
> (vec_info
> >> >> *vinfo,
> >> >> >> >        return pattern_stmt;
> >> >> >> >      }
> >> >> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
> >> >> >> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
> >> >> >> vectype,
> >> >> >> > -							  wi::to_wide
> >> (cst),
> >> >> >> > -							  NULL,
> >> NULL_RTX,
> >> >> >> > -							  NULL_RTX))
> >> >> >> > +	   && TYPE_UNSIGNED (itype)
> >> >> >> > +	   && rhs_code == TRUNC_DIV_EXPR
> >> >> >> > +	   && vectype
> >> >> >> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> >> >> >> > +					      OPTIMIZE_FOR_SPEED))
> >> >> >> >      {
> >> >> >> > -      return NULL;
> >> >> >> > +      /* div optimizations using narrowings
> >> >> >> > +       we can do the division e.g. shorts by 255 faster by
> >> >> >> > + calculating it
> >> as
> >> >> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> >> >> >> > +       double the precision of x.
> >> >> >> > +
> >> >> >> > +       If we imagine a short as being composed of two blocks
> >> >> >> > + of bytes
> >> >> then
> >> >> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is
> >> >> >> > + equivalent
> >> >> to
> >> >> >> > +       adding 1 to each sub component:
> >> >> >> > +
> >> >> >> > +	    short value of 16-bits
> >> >> >> > +       ┌──────────────┬────────────────┐
> >> >> >> > +       │              │                │
> >> >> >> > +       └──────────────┴────────────────┘
> >> >> >> > +	 8-bit part1 ▲  8-bit part2   ▲
> >> >> >> > +		     │                │
> >> >> >> > +		     │                │
> >> >> >> > +		    +1               +1
> >> >> >> > +
> >> >> >> > +       after the first addition, we have to shift right by
> >> >> >> > + 8, and narrow
> >> the
> >> >> >> > +       results back to a byte.  Remember that the addition
> >> >> >> > + must be done
> >> >> in
> >> >> >> > +       double the precision of the input.  However if we
> >> >> >> > + know that the
> >> >> >> addition
> >> >> >> > +       `x + 257` does not overflow then we can do the
> >> >> >> > + operation in the
> >> >> >> current
> >> >> >> > +       precision.  In which case we don't need the pack and
> unpacks.
> >> */
> >> >> >> > +      auto wcst = wi::to_wide (cst);
> >> >> >> > +      int pow = wi::exact_log2 (wcst + 1);
> >> >> >> > +      if (pow == (int) (element_precision (vectype) / 2))
> >> >> >> > +	{
> >> >> >> > +	  wide_int min,max;
> >> >> >> > +	  /* If we're in a pattern we need to find the orginal
> >> definition.  */
> >> >> >> > +	  tree op0 = oprnd0;
> >> >> >> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> >> >> >> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> >> >> >> > +	  if (is_pattern_stmt_p (stmt_info))
> >> >> >> > +	    {
> >> >> >> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT
> >> (stmt_info);
> >> >> >> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> >> >> >> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT
> >> (orig_stmt));
> >> >> >> > +	    }
> >> >> >>
> >> >> >> If this is generally safe (I'm skipping thinking about it in
> >> >> >> the interests of a quick review :-)), then I think it should be
> >> >> >> done in vect_get_range_info instead.  Using gimple_get_lhs
> >> >> >> would be more general than handling just assignments.
> >> >> >>
> >> >> >> > +
> >> >> >> > +	  /* Check that no overflow will occur.  If we don't have range
> >> >> >> > +	     information we can't perform the optimization.  */
> >> >> >> > +	  if (vect_get_range_info (op0, &min, &max))
> >> >> >> > +	    {
> >> >> >> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
> >> >> >> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> >> >> >> > +	      wi::overflow_type ovf;
> >> >> >> > +	      /* We need adder and max in the same precision.  */
> >> >> >> > +	      wide_int zadder
> >> >> >> > +		= wide_int_storage::from (adder, wi::get_precision
> >> (max),
> >> >> >> > +					  UNSIGNED);
> >> >> >> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
> >> >> >>
> >> >> >> Could you explain this a bit more?  When do we have mismatched
> >> >> >> precisions?
> >> >> >
> >> >> > C promotion rules will promote e.g.
> >> >> >
> >> >> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
> >> >> >   for (int i = 0; i < n; i+=1)
> >> >> >     pixel[i] = (pixel[i] + level) / 0xff; }
> >> >> >
> >> >> > And have the addition be done as a 32 bit integer.  The
> >> >> > vectorizer will demote this down to a short, but range
> >> >> > information is not stored for patterns.  So In the above the
> >> >> > range will correctly be 0x1fe but the precision will be that of
> >> >> > the original expression, so 32.  This will be a mismatch with
> >> >> > itype which is derived from the size the vectorizer
> >> >> will perform the operation in.
> >> >>
> >> >> Gah, missed this first time round, sorry.
> >> >>
> >> >> Richi would know better than me, but I think it's dangerous to
> >> >> rely on the orig/pattern link for range information.  The end
> >> >> result of a pattern
> >> >> (vect_stmt_to_vectorize) has to have the same type as the lhs of
> >> >> the original statement.  But the other statements in the pattern
> >> >> sequence can do arbitrary things.  Their range isn't predictable
> >> >> from the range of the original statement result.
> >> >>
> >> >> IIRC, the addition above is converted to:
> >> >>
> >> >>   a' = (uint16_t) pixel[i]
> >> >>   b' = (uint16_t) level
> >> >>   sum' = a' + b'
> >> >>   sum = (int) sum'
> >> >>
> >> >> where sum is the direct replacement of "pixel[i] + level", with
> >> >> the same type and range.  The division then uses sum' instead of sum.
> >> >>
> >> >> But the fact that sum' is part of the same pattern as sum doesn't
> >> >> guarantee that sum' has the same range as sum.  E.g. the pattern
> >> >> statements added by the division optimisation wouldn't have this
> >> property.
> >> >
> >> > So my assumption is that no pattern would replace a statement with
> >> > something That has higher precision than the C statement. The
> >> > pattern above is demoted By the vectorizer based on range information
> already.
> >> > My assumption was that the precision can only ever be smaller,
> >> > because otherwise the pattern has violated the semantics of the C
> >> > code, which
> >> would be dangerous if e.g. the expression escapes?
> >>
> >> IMO the difference in precisions was a symptom of the problem rather
> >> than the direct cause.
> >>
> >> The point is more that "B = vect_orig_stmt(A)" just says "A is used
> >> somehow in a new calculation of B".  A might equal B (if A replaces
> >> B), or A might be an arbitrary temporary result.  The code above is
> >> instead using it to mean "A equals B, expressed in a different type".
> >> That happens to be true for sum' in the sequence above, but it isn't
> >> true of non-final pattern statements in general.
> >>
> >
> > Sorry for being dense, but I though that's exactly what the code does
> > and what I tried explain before. If B isn't a final statement than it won't
> have an original statement.
> > AFAIK, the only places we set original statement is the root of the pattern
> expression.
> 
> Final pattern statements (those not in DEF_SEQ) always have the same type
> and value as the original statements.  We wouldn't see mismatched
> precisions if we were only looking at final pattern statements.

We would because the entire problem is that pattern statement have no ranges.
Ranger does not track them after they have been created.  This could of course
Trivially be solved if we tell ranger about the demotion we did, but we don't do so
at the moment. It will just return varying here.  This is the root cause of the issue.

> 
> Like you say, the 16-bit addition didn't exist before vectorisation (it was a 32-
> bit addition instead).  So to make things type-correct, the 32-bit addition:
> 
>    A: sum = a + b           (STMT_VINFO_RELATED_STMT == A2)
> 
> is replaced with:
> 
>    DEF_SEQ:
>      A1: tmp = a' + b'      (STMT_VINFO_RELATED_STMT == A)
>    A2: sum' = (int) tmp     (STMT_VINFO_RELATED_STMT == A)
> 
> (using different notation from before, just to confuse things).
> Here, A2 is the final pattern statement for A and A1 is just a temporary result.
> sum == sum'.
> 
> Later, we do a similar thing for the division itself.  We have:
> 
>    B: quotient = sum / 0xff            (STMT_VINFO_RELATED_STMT == B2)
> 
> We realise that this can be a 16-bit division, so (IIRC) we use
> vect_look_through_possible_promotion on sum to find the best starting
> point.  This should give:
> 
>    DEF_SEQ:
>      B1: tmp2 = tmp / (uint16_t) 0xff  (STMT_VINFO_RELATED_STMT == B)
>    B2: quotient' = (int) tmp2          (STMT_VINFO_RELATED_STMT == B)
> 
> Both changes are done by vect_widened_op_tree.
> 
> We then apply the division pattern to B1.  B1 is a nonfinal pattern statement
> that uses the result (tmp) of another nonfinal pattern statement (A1).
> 
> The code does:
> 
> 	  if (is_pattern_stmt_p (stmt_info))
> 	    {
> 	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
> 	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> 		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
> 	    }
> 
> is_pattern_stmt_p is true for both A1 and A2, and
> STMT_VINFO_RELATED_STMT is A for both A1 and A2.  I would expect:
> 
>   gcc_assert (stmt_info == vect_stmt_to_vectorize (orig_stmt));
> 
> (testing for a final pattern) to fail for the motivating example.
> 

I think we're actually saying the same thing. I believe all I'm saying is that looking
at the original statement is a safe alternative as it conservatively will overestimate
to VARYING or give a wider range than the pattern would have.

I'm saying it's conservatively safe, while not overly accurate.  The alternative would be
to tell ranger about the demotions in vect_recog_over_widening_pattern using range::set.

But for this to work the general widening pattern also have to update the range information.

I think where we're disagreeing is that I think looking at the original scalar statement is a safe
conservative estimate.  It will fail in some cases, but that's a missed optimization, not a miss-optimization.

In any case, if you disagree I don’t' really see a way forward aside from making this its own pattern
running it before the overwidening pattern.

Alternatively I'd love to know how to proceed.

Tamar

> Thanks,
> Richard
Richard Sandiford Feb. 10, 2023, 6:12 p.m. UTC | #15
Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Friday, February 10, 2023 4:57 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
>> by using new optabs [PR108583]
>> 
>> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> -----Original Message-----
>> >> From: Richard Sandiford <richard.sandiford@arm.com>
>> >> Sent: Friday, February 10, 2023 4:25 PM
>> >> To: Tamar Christina <Tamar.Christina@arm.com>
>> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
>> >> div-bitmask by using new optabs [PR108583]
>> >>
>> >> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> >> -----Original Message-----
>> >> >> From: Richard Sandiford <richard.sandiford@arm.com>
>> >> >> Sent: Friday, February 10, 2023 3:57 PM
>> >> >> To: Tamar Christina <Tamar.Christina@arm.com>
>> >> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> >> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>> >> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
>> >> >> div-bitmask by using new optabs [PR108583]
>> >> >>
>> >> >> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> >> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
>> >> >> >> 69
>> >> >> >> > de2afea139d6 100644
>> >> >> >> > --- a/gcc/tree-vect-patterns.cc
>> >> >> >> > +++ b/gcc/tree-vect-patterns.cc
>> >> >> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern
>> (vec_info
>> >> >> *vinfo,
>> >> >> >> >        return pattern_stmt;
>> >> >> >> >      }
>> >> >> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
>> >> >> >> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
>> >> >> >> vectype,
>> >> >> >> > -							  wi::to_wide
>> >> (cst),
>> >> >> >> > -							  NULL,
>> >> NULL_RTX,
>> >> >> >> > -							  NULL_RTX))
>> >> >> >> > +	   && TYPE_UNSIGNED (itype)
>> >> >> >> > +	   && rhs_code == TRUNC_DIV_EXPR
>> >> >> >> > +	   && vectype
>> >> >> >> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
>> >> >> >> > +					      OPTIMIZE_FOR_SPEED))
>> >> >> >> >      {
>> >> >> >> > -      return NULL;
>> >> >> >> > +      /* div optimizations using narrowings
>> >> >> >> > +       we can do the division e.g. shorts by 255 faster by
>> >> >> >> > + calculating it
>> >> as
>> >> >> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
>> >> >> >> > +       double the precision of x.
>> >> >> >> > +
>> >> >> >> > +       If we imagine a short as being composed of two blocks
>> >> >> >> > + of bytes
>> >> >> then
>> >> >> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is
>> >> >> >> > + equivalent
>> >> >> to
>> >> >> >> > +       adding 1 to each sub component:
>> >> >> >> > +
>> >> >> >> > +	    short value of 16-bits
>> >> >> >> > +       ┌──────────────┬────────────────┐
>> >> >> >> > +       │              │                │
>> >> >> >> > +       └──────────────┴────────────────┘
>> >> >> >> > +	 8-bit part1 ▲  8-bit part2   ▲
>> >> >> >> > +		     │                │
>> >> >> >> > +		     │                │
>> >> >> >> > +		    +1               +1
>> >> >> >> > +
>> >> >> >> > +       after the first addition, we have to shift right by
>> >> >> >> > + 8, and narrow
>> >> the
>> >> >> >> > +       results back to a byte.  Remember that the addition
>> >> >> >> > + must be done
>> >> >> in
>> >> >> >> > +       double the precision of the input.  However if we
>> >> >> >> > + know that the
>> >> >> >> addition
>> >> >> >> > +       `x + 257` does not overflow then we can do the
>> >> >> >> > + operation in the
>> >> >> >> current
>> >> >> >> > +       precision.  In which case we don't need the pack and
>> unpacks.
>> >> */
>> >> >> >> > +      auto wcst = wi::to_wide (cst);
>> >> >> >> > +      int pow = wi::exact_log2 (wcst + 1);
>> >> >> >> > +      if (pow == (int) (element_precision (vectype) / 2))
>> >> >> >> > +	{
>> >> >> >> > +	  wide_int min,max;
>> >> >> >> > +	  /* If we're in a pattern we need to find the orginal
>> >> definition.  */
>> >> >> >> > +	  tree op0 = oprnd0;
>> >> >> >> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
>> >> >> >> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
>> >> >> >> > +	  if (is_pattern_stmt_p (stmt_info))
>> >> >> >> > +	    {
>> >> >> >> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT
>> >> (stmt_info);
>> >> >> >> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>> >> >> >> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT
>> >> (orig_stmt));
>> >> >> >> > +	    }
>> >> >> >>
>> >> >> >> If this is generally safe (I'm skipping thinking about it in
>> >> >> >> the interests of a quick review :-)), then I think it should be
>> >> >> >> done in vect_get_range_info instead.  Using gimple_get_lhs
>> >> >> >> would be more general than handling just assignments.
>> >> >> >>
>> >> >> >> > +
>> >> >> >> > +	  /* Check that no overflow will occur.  If we don't have range
>> >> >> >> > +	     information we can't perform the optimization.  */
>> >> >> >> > +	  if (vect_get_range_info (op0, &min, &max))
>> >> >> >> > +	    {
>> >> >> >> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
>> >> >> >> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
>> >> >> >> > +	      wi::overflow_type ovf;
>> >> >> >> > +	      /* We need adder and max in the same precision.  */
>> >> >> >> > +	      wide_int zadder
>> >> >> >> > +		= wide_int_storage::from (adder, wi::get_precision
>> >> (max),
>> >> >> >> > +					  UNSIGNED);
>> >> >> >> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
>> >> >> >>
>> >> >> >> Could you explain this a bit more?  When do we have mismatched
>> >> >> >> precisions?
>> >> >> >
>> >> >> > C promotion rules will promote e.g.
>> >> >> >
>> >> >> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
>> >> >> >   for (int i = 0; i < n; i+=1)
>> >> >> >     pixel[i] = (pixel[i] + level) / 0xff; }
>> >> >> >
>> >> >> > And have the addition be done as a 32 bit integer.  The
>> >> >> > vectorizer will demote this down to a short, but range
>> >> >> > information is not stored for patterns.  So In the above the
>> >> >> > range will correctly be 0x1fe but the precision will be that of
>> >> >> > the original expression, so 32.  This will be a mismatch with
>> >> >> > itype which is derived from the size the vectorizer
>> >> >> will perform the operation in.
>> >> >>
>> >> >> Gah, missed this first time round, sorry.
>> >> >>
>> >> >> Richi would know better than me, but I think it's dangerous to
>> >> >> rely on the orig/pattern link for range information.  The end
>> >> >> result of a pattern
>> >> >> (vect_stmt_to_vectorize) has to have the same type as the lhs of
>> >> >> the original statement.  But the other statements in the pattern
>> >> >> sequence can do arbitrary things.  Their range isn't predictable
>> >> >> from the range of the original statement result.
>> >> >>
>> >> >> IIRC, the addition above is converted to:
>> >> >>
>> >> >>   a' = (uint16_t) pixel[i]
>> >> >>   b' = (uint16_t) level
>> >> >>   sum' = a' + b'
>> >> >>   sum = (int) sum'
>> >> >>
>> >> >> where sum is the direct replacement of "pixel[i] + level", with
>> >> >> the same type and range.  The division then uses sum' instead of sum.
>> >> >>
>> >> >> But the fact that sum' is part of the same pattern as sum doesn't
>> >> >> guarantee that sum' has the same range as sum.  E.g. the pattern
>> >> >> statements added by the division optimisation wouldn't have this
>> >> property.
>> >> >
>> >> > So my assumption is that no pattern would replace a statement with
>> >> > something That has higher precision than the C statement. The
>> >> > pattern above is demoted By the vectorizer based on range information
>> already.
>> >> > My assumption was that the precision can only ever be smaller,
>> >> > because otherwise the pattern has violated the semantics of the C
>> >> > code, which
>> >> would be dangerous if e.g. the expression escapes?
>> >>
>> >> IMO the difference in precisions was a symptom of the problem rather
>> >> than the direct cause.
>> >>
>> >> The point is more that "B = vect_orig_stmt(A)" just says "A is used
>> >> somehow in a new calculation of B".  A might equal B (if A replaces
>> >> B), or A might be an arbitrary temporary result.  The code above is
>> >> instead using it to mean "A equals B, expressed in a different type".
>> >> That happens to be true for sum' in the sequence above, but it isn't
>> >> true of non-final pattern statements in general.
>> >>
>> >
>> > Sorry for being dense, but I though that's exactly what the code does
>> > and what I tried explain before. If B isn't a final statement than it won't
>> have an original statement.
>> > AFAIK, the only places we set original statement is the root of the pattern
>> expression.
>> 
>> Final pattern statements (those not in DEF_SEQ) always have the same type
>> and value as the original statements.  We wouldn't see mismatched
>> precisions if we were only looking at final pattern statements.
>
> We would because the entire problem is that pattern statement have no ranges.
> Ranger does not track them after they have been created.  This could of course
> Trivially be solved if we tell ranger about the demotion we did, but we don't do so
> at the moment. It will just return varying here.  This is the root cause of the issue.
>
>> 
>> Like you say, the 16-bit addition didn't exist before vectorisation (it was a 32-
>> bit addition instead).  So to make things type-correct, the 32-bit addition:
>> 
>>    A: sum = a + b           (STMT_VINFO_RELATED_STMT == A2)
>> 
>> is replaced with:
>> 
>>    DEF_SEQ:
>>      A1: tmp = a' + b'      (STMT_VINFO_RELATED_STMT == A)
>>    A2: sum' = (int) tmp     (STMT_VINFO_RELATED_STMT == A)
>> 
>> (using different notation from before, just to confuse things).
>> Here, A2 is the final pattern statement for A and A1 is just a temporary result.
>> sum == sum'.
>> 
>> Later, we do a similar thing for the division itself.  We have:
>> 
>>    B: quotient = sum / 0xff            (STMT_VINFO_RELATED_STMT == B2)
>> 
>> We realise that this can be a 16-bit division, so (IIRC) we use
>> vect_look_through_possible_promotion on sum to find the best starting
>> point.  This should give:
>> 
>>    DEF_SEQ:
>>      B1: tmp2 = tmp / (uint16_t) 0xff  (STMT_VINFO_RELATED_STMT == B)
>>    B2: quotient' = (int) tmp2          (STMT_VINFO_RELATED_STMT == B)
>> 
>> Both changes are done by vect_widened_op_tree.
>> 
>> We then apply the division pattern to B1.  B1 is a nonfinal pattern statement
>> that uses the result (tmp) of another nonfinal pattern statement (A1).
>> 
>> The code does:
>> 
>> 	  if (is_pattern_stmt_p (stmt_info))
>> 	    {
>> 	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
>> 	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>> 		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
>> 	    }
>> 
>> is_pattern_stmt_p is true for both A1 and A2, and
>> STMT_VINFO_RELATED_STMT is A for both A1 and A2.  I would expect:
>> 
>>   gcc_assert (stmt_info == vect_stmt_to_vectorize (orig_stmt));
>> 
>> (testing for a final pattern) to fail for the motivating example.
>> 
>
> I think we're actually saying the same thing. I believe all I'm saying is that looking
> at the original statement is a safe alternative as it conservatively will overestimate
> to VARYING or give a wider range than the pattern would have.

Hmm, but you said "If B isn't a final statement than it won't have an
original statement. AFAIK, the only places we set original statement
is the root of the pattern expression."  My point was that that isn't true.
All statements in the pattern have an original statement, not just the root.
And we're specifically relying on that for the motivating example to work.

> I'm saying it's conservatively safe, while not overly accurate.  The alternative would be
> to tell ranger about the demotions in vect_recog_over_widening_pattern using range::set.
>
> But for this to work the general widening pattern also have to update the range information.
>
> I think where we're disagreeing is that I think looking at the original scalar statement is a safe
> conservative estimate.  It will fail in some cases, but that's a missed optimization, not a miss-optimization.

Yeah, like you say, I disagree that it's conservatively correct.
It means that we're hoping (without proving) that the only things
between stmt_info and the final pattern statement are conversions.
I don't think there's any reason in principle why that must hold.

What would be conservatively correct would be to start from the
final pattern statement and work our way down to the value that
is actually being used.  That seems a bit convoluted though,
so I'd prefer not to do that...

> In any case, if you disagree I don’t' really see a way forward aside from making this its own pattern
> running it before the overwidening pattern.

I think we should look to see if ranger can be persuaded to provide the
range of the 16-bit addition, even though the statement that produces it
isn't part of a BB.  It shouldn't matter that the addition originally
came from a 32-bit one: the range follows directly from the ranges of
the operands (i.e. the fact that the operands are the results of
widening conversions).

Thanks,
Richard
Richard Biener Feb. 10, 2023, 6:34 p.m. UTC | #16
> Am 10.02.2023 um 19:12 schrieb Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org>:
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
>>> -----Original Message-----
>>> From: Richard Sandiford <richard.sandiford@arm.com>
>>> Sent: Friday, February 10, 2023 4:57 PM
>>> To: Tamar Christina <Tamar.Christina@arm.com>
>>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>>> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
>>> by using new optabs [PR108583]
>>> 
>>> Tamar Christina <Tamar.Christina@arm.com> writes:
>>>>> -----Original Message-----
>>>>> From: Richard Sandiford <richard.sandiford@arm.com>
>>>>> Sent: Friday, February 10, 2023 4:25 PM
>>>>> To: Tamar Christina <Tamar.Christina@arm.com>
>>>>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>>>>> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>>>>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
>>>>> div-bitmask by using new optabs [PR108583]
>>>>> 
>>>>> Tamar Christina <Tamar.Christina@arm.com> writes:
>>>>>>> -----Original Message-----
>>>>>>> From: Richard Sandiford <richard.sandiford@arm.com>
>>>>>>> Sent: Friday, February 10, 2023 3:57 PM
>>>>>>> To: Tamar Christina <Tamar.Christina@arm.com>
>>>>>>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>>>>>>> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>>>>>>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
>>>>>>> div-bitmask by using new optabs [PR108583]
>>>>>>> 
>>>>>>> Tamar Christina <Tamar.Christina@arm.com> writes:
>>>>>>>>>> a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
>>>>>>>>> 69
>>>>>>>>>> de2afea139d6 100644
>>>>>>>>>> --- a/gcc/tree-vect-patterns.cc
>>>>>>>>>> +++ b/gcc/tree-vect-patterns.cc
>>>>>>>>>> @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern
>>> (vec_info
>>>>>>> *vinfo,
>>>>>>>>>>       return pattern_stmt;
>>>>>>>>>>     }
>>>>>>>>>>   else if ((cst = uniform_integer_cst_p (oprnd1))
>>>>>>>>>> -       && targetm.vectorize.can_special_div_by_const (rhs_code,
>>>>>>>>> vectype,
>>>>>>>>>> -                              wi::to_wide
>>>>> (cst),
>>>>>>>>>> -                              NULL,
>>>>> NULL_RTX,
>>>>>>>>>> -                              NULL_RTX))
>>>>>>>>>> +       && TYPE_UNSIGNED (itype)
>>>>>>>>>> +       && rhs_code == TRUNC_DIV_EXPR
>>>>>>>>>> +       && vectype
>>>>>>>>>> +       && direct_internal_fn_supported_p (IFN_ADDH, vectype,
>>>>>>>>>> +                          OPTIMIZE_FOR_SPEED))
>>>>>>>>>>     {
>>>>>>>>>> -      return NULL;
>>>>>>>>>> +      /* div optimizations using narrowings
>>>>>>>>>> +       we can do the division e.g. shorts by 255 faster by
>>>>>>>>>> + calculating it
>>>>> as
>>>>>>>>>> +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
>>>>>>>>>> +       double the precision of x.
>>>>>>>>>> +
>>>>>>>>>> +       If we imagine a short as being composed of two blocks
>>>>>>>>>> + of bytes
>>>>>>> then
>>>>>>>>>> +       adding 257 or 0b0000_0001_0000_0001 to the number is
>>>>>>>>>> + equivalent
>>>>>>> to
>>>>>>>>>> +       adding 1 to each sub component:
>>>>>>>>>> +
>>>>>>>>>> +        short value of 16-bits
>>>>>>>>>> +       ┌──────────────┬────────────────┐
>>>>>>>>>> +       │              │                │
>>>>>>>>>> +       └──────────────┴────────────────┘
>>>>>>>>>> +     8-bit part1 ▲  8-bit part2   ▲
>>>>>>>>>> +             │                │
>>>>>>>>>> +             │                │
>>>>>>>>>> +            +1               +1
>>>>>>>>>> +
>>>>>>>>>> +       after the first addition, we have to shift right by
>>>>>>>>>> + 8, and narrow
>>>>> the
>>>>>>>>>> +       results back to a byte.  Remember that the addition
>>>>>>>>>> + must be done
>>>>>>> in
>>>>>>>>>> +       double the precision of the input.  However if we
>>>>>>>>>> + know that the
>>>>>>>>> addition
>>>>>>>>>> +       `x + 257` does not overflow then we can do the
>>>>>>>>>> + operation in the
>>>>>>>>> current
>>>>>>>>>> +       precision.  In which case we don't need the pack and
>>> unpacks.
>>>>> */
>>>>>>>>>> +      auto wcst = wi::to_wide (cst);
>>>>>>>>>> +      int pow = wi::exact_log2 (wcst + 1);
>>>>>>>>>> +      if (pow == (int) (element_precision (vectype) / 2))
>>>>>>>>>> +    {
>>>>>>>>>> +      wide_int min,max;
>>>>>>>>>> +      /* If we're in a pattern we need to find the orginal
>>>>> definition.  */
>>>>>>>>>> +      tree op0 = oprnd0;
>>>>>>>>>> +      gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
>>>>>>>>>> +      stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
>>>>>>>>>> +      if (is_pattern_stmt_p (stmt_info))
>>>>>>>>>> +        {
>>>>>>>>>> +          auto orig_stmt = STMT_VINFO_RELATED_STMT
>>>>> (stmt_info);
>>>>>>>>>> +          if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>>>>>>>>>> +        op0 = gimple_assign_lhs (STMT_VINFO_STMT
>>>>> (orig_stmt));
>>>>>>>>>> +        }
>>>>>>>>> 
>>>>>>>>> If this is generally safe (I'm skipping thinking about it in
>>>>>>>>> the interests of a quick review :-)), then I think it should be
>>>>>>>>> done in vect_get_range_info instead.  Using gimple_get_lhs
>>>>>>>>> would be more general than handling just assignments.
>>>>>>>>> 
>>>>>>>>>> +
>>>>>>>>>> +      /* Check that no overflow will occur.  If we don't have range
>>>>>>>>>> +         information we can't perform the optimization.  */
>>>>>>>>>> +      if (vect_get_range_info (op0, &min, &max))
>>>>>>>>>> +        {
>>>>>>>>>> +          wide_int one = wi::to_wide (build_one_cst (itype));
>>>>>>>>>> +          wide_int adder = wi::add (one, wi::lshift (one, pow));
>>>>>>>>>> +          wi::overflow_type ovf;
>>>>>>>>>> +          /* We need adder and max in the same precision.  */
>>>>>>>>>> +          wide_int zadder
>>>>>>>>>> +        = wide_int_storage::from (adder, wi::get_precision
>>>>> (max),
>>>>>>>>>> +                      UNSIGNED);
>>>>>>>>>> +          wi::add (max, zadder, UNSIGNED, &ovf);
>>>>>>>>> 
>>>>>>>>> Could you explain this a bit more?  When do we have mismatched
>>>>>>>>> precisions?
>>>>>>>> 
>>>>>>>> C promotion rules will promote e.g.
>>>>>>>> 
>>>>>>>> void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
>>>>>>>>  for (int i = 0; i < n; i+=1)
>>>>>>>>    pixel[i] = (pixel[i] + level) / 0xff; }
>>>>>>>> 
>>>>>>>> And have the addition be done as a 32 bit integer.  The
>>>>>>>> vectorizer will demote this down to a short, but range
>>>>>>>> information is not stored for patterns.  So In the above the
>>>>>>>> range will correctly be 0x1fe but the precision will be that of
>>>>>>>> the original expression, so 32.  This will be a mismatch with
>>>>>>>> itype which is derived from the size the vectorizer
>>>>>>> will perform the operation in.
>>>>>>> 
>>>>>>> Gah, missed this first time round, sorry.
>>>>>>> 
>>>>>>> Richi would know better than me, but I think it's dangerous to
>>>>>>> rely on the orig/pattern link for range information.  The end
>>>>>>> result of a pattern
>>>>>>> (vect_stmt_to_vectorize) has to have the same type as the lhs of
>>>>>>> the original statement.  But the other statements in the pattern
>>>>>>> sequence can do arbitrary things.  Their range isn't predictable
>>>>>>> from the range of the original statement result.
>>>>>>> 
>>>>>>> IIRC, the addition above is converted to:
>>>>>>> 
>>>>>>>  a' = (uint16_t) pixel[i]
>>>>>>>  b' = (uint16_t) level
>>>>>>>  sum' = a' + b'
>>>>>>>  sum = (int) sum'
>>>>>>> 
>>>>>>> where sum is the direct replacement of "pixel[i] + level", with
>>>>>>> the same type and range.  The division then uses sum' instead of sum.
>>>>>>> 
>>>>>>> But the fact that sum' is part of the same pattern as sum doesn't
>>>>>>> guarantee that sum' has the same range as sum.  E.g. the pattern
>>>>>>> statements added by the division optimisation wouldn't have this
>>>>> property.
>>>>>> 
>>>>>> So my assumption is that no pattern would replace a statement with
>>>>>> something That has higher precision than the C statement. The
>>>>>> pattern above is demoted By the vectorizer based on range information
>>> already.
>>>>>> My assumption was that the precision can only ever be smaller,
>>>>>> because otherwise the pattern has violated the semantics of the C
>>>>>> code, which
>>>>> would be dangerous if e.g. the expression escapes?
>>>>> 
>>>>> IMO the difference in precisions was a symptom of the problem rather
>>>>> than the direct cause.
>>>>> 
>>>>> The point is more that "B = vect_orig_stmt(A)" just says "A is used
>>>>> somehow in a new calculation of B".  A might equal B (if A replaces
>>>>> B), or A might be an arbitrary temporary result.  The code above is
>>>>> instead using it to mean "A equals B, expressed in a different type".
>>>>> That happens to be true for sum' in the sequence above, but it isn't
>>>>> true of non-final pattern statements in general.
>>>>> 
>>>> 
>>>> Sorry for being dense, but I though that's exactly what the code does
>>>> and what I tried explain before. If B isn't a final statement than it won't
>>> have an original statement.
>>>> AFAIK, the only places we set original statement is the root of the pattern
>>> expression.
>>> 
>>> Final pattern statements (those not in DEF_SEQ) always have the same type
>>> and value as the original statements.  We wouldn't see mismatched
>>> precisions if we were only looking at final pattern statements.
>> 
>> We would because the entire problem is that pattern statement have no ranges.
>> Ranger does not track them after they have been created.  This could of course
>> Trivially be solved if we tell ranger about the demotion we did, but we don't do so
>> at the moment. It will just return varying here.  This is the root cause of the issue.
>> 
>>> 
>>> Like you say, the 16-bit addition didn't exist before vectorisation (it was a 32-
>>> bit addition instead).  So to make things type-correct, the 32-bit addition:
>>> 
>>>   A: sum = a + b           (STMT_VINFO_RELATED_STMT == A2)
>>> 
>>> is replaced with:
>>> 
>>>   DEF_SEQ:
>>>     A1: tmp = a' + b'      (STMT_VINFO_RELATED_STMT == A)
>>>   A2: sum' = (int) tmp     (STMT_VINFO_RELATED_STMT == A)
>>> 
>>> (using different notation from before, just to confuse things).
>>> Here, A2 is the final pattern statement for A and A1 is just a temporary result.
>>> sum == sum'.
>>> 
>>> Later, we do a similar thing for the division itself.  We have:
>>> 
>>>   B: quotient = sum / 0xff            (STMT_VINFO_RELATED_STMT == B2)
>>> 
>>> We realise that this can be a 16-bit division, so (IIRC) we use
>>> vect_look_through_possible_promotion on sum to find the best starting
>>> point.  This should give:
>>> 
>>>   DEF_SEQ:
>>>     B1: tmp2 = tmp / (uint16_t) 0xff  (STMT_VINFO_RELATED_STMT == B)
>>>   B2: quotient' = (int) tmp2          (STMT_VINFO_RELATED_STMT == B)
>>> 
>>> Both changes are done by vect_widened_op_tree.
>>> 
>>> We then apply the division pattern to B1.  B1 is a nonfinal pattern statement
>>> that uses the result (tmp) of another nonfinal pattern statement (A1).
>>> 
>>> The code does:
>>> 
>>>      if (is_pattern_stmt_p (stmt_info))
>>>        {
>>>          auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
>>>          if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>>>        op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
>>>        }
>>> 
>>> is_pattern_stmt_p is true for both A1 and A2, and
>>> STMT_VINFO_RELATED_STMT is A for both A1 and A2.  I would expect:
>>> 
>>>  gcc_assert (stmt_info == vect_stmt_to_vectorize (orig_stmt));
>>> 
>>> (testing for a final pattern) to fail for the motivating example.
>>> 
>> 
>> I think we're actually saying the same thing. I believe all I'm saying is that looking
>> at the original statement is a safe alternative as it conservatively will overestimate
>> to VARYING or give a wider range than the pattern would have.
> 
> Hmm, but you said "If B isn't a final statement than it won't have an
> original statement. AFAIK, the only places we set original statement
> is the root of the pattern expression."  My point was that that isn't true.
> All statements in the pattern have an original statement, not just the root.
> And we're specifically relying on that for the motivating example to work.
> 
>> I'm saying it's conservatively safe, while not overly accurate.  The alternative would be
>> to tell ranger about the demotions in vect_recog_over_widening_pattern using range::set.
>> 
>> But for this to work the general widening pattern also have to update the range information.
>> 
>> I think where we're disagreeing is that I think looking at the original scalar statement is a safe
>> conservative estimate.  It will fail in some cases, but that's a missed optimization, not a miss-optimization.
> 
> Yeah, like you say, I disagree that it's conservatively correct.
> It means that we're hoping (without proving) that the only things
> between stmt_info and the final pattern statement are conversions.
> I don't think there's any reason in principle why that must hold.
> 
> What would be conservatively correct would be to start from the
> final pattern statement and work our way down to the value that
> is actually being used.  That seems a bit convoluted though,
> so I'd prefer not to do that...
> 
>> In any case, if you disagree I don’t' really see a way forward aside from making this its own pattern
>> running it before the overwidening pattern.
> 
> I think we should look to see if ranger can be persuaded to provide the
> range of the 16-bit addition, even though the statement that produces it
> isn't part of a BB.  It shouldn't matter that the addition originally
> came from a 32-bit one: the range follows directly from the ranges of
> the operands (i.e. the fact that the operands are the results of
> widening conversions).

I think you can ask ranger on operations on names defined in the IL, so you can work yourself through the sequence of operations in the pattern sequence to compute ranges on their defs (and possibly even store them in the SSA info).  You just need to pick the correct ranger API for this…. Andrew CCed

Richard 

> Thanks,
> Richard
Andrew MacLeod Feb. 10, 2023, 8:58 p.m. UTC | #17
On 2/10/23 13:34, Richard Biener wrote:
>
>>> In any case, if you disagree I don’t' really see a way forward aside from making this its own pattern
>>> running it before the overwidening pattern.
>> I think we should look to see if ranger can be persuaded to provide the
>> range of the 16-bit addition, even though the statement that produces it
>> isn't part of a BB.  It shouldn't matter that the addition originally
>> came from a 32-bit one: the range follows directly from the ranges of
>> the operands (i.e. the fact that the operands are the results of
>> widening conversions).
> I think you can ask ranger on operations on names defined in the IL, so you can work yourself through the sequence of operations in the pattern sequence to compute ranges on their defs (and possibly even store them in the SSA info).  You just need to pick the correct ranger API for this…. Andrew CCed
>
>
Its not clear to me whats being asked...

Expressions don't need to be in the IL to do range calculations.. I 
believe we support arbitrary tree expressions via range_of_expr.

if you have 32 bit ranges that you want to do 16 bit addition on, you 
can also cast those ranges to a 16bit type,

my32bitrange.cast (my16bittype);

then invoke range-ops directly via getting the handler:

handler = range_op_handler (PLUS_EXPR, 16bittype_tree);
if (handler)
    handler->fold (result, my16bittype, mycasted32bitrange, 
myothercasted32bitrange)

There are higher level APIs if what you have on hand is closer to IL 
than random ranges

Describe exactly what it is you want to do... and I'll try to direct you 
to the best way to do it.

Andrew
Tamar Christina Feb. 13, 2023, 9:54 a.m. UTC | #18
> -----Original Message-----
> From: Andrew MacLeod <amacleod@redhat.com>
> Sent: Friday, February 10, 2023 8:59 PM
> To: Richard Biener <rguenther@suse.de>; Richard Sandiford
> <Richard.Sandiford@arm.com>
> Cc: Tamar Christina <Tamar.Christina@arm.com>; Tamar Christina via Gcc-
> patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>;
> jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> 
> On 2/10/23 13:34, Richard Biener wrote:
> >
> >>> In any case, if you disagree I don’t' really see a way forward aside
> >>> from making this its own pattern running it before the overwidening
> pattern.
> >> I think we should look to see if ranger can be persuaded to provide
> >> the range of the 16-bit addition, even though the statement that
> >> produces it isn't part of a BB.  It shouldn't matter that the
> >> addition originally came from a 32-bit one: the range follows
> >> directly from the ranges of the operands (i.e. the fact that the
> >> operands are the results of widening conversions).
> > I think you can ask ranger on operations on names defined in the IL,
> > so you can work yourself through the sequence of operations in the
> > pattern sequence to compute ranges on their defs (and possibly even
> > store them in the SSA info).  You just need to pick the correct ranger
> > API for this…. Andrew CCed
> >
> >
> Its not clear to me whats being asked...
> 
> Expressions don't need to be in the IL to do range calculations.. I believe we
> support arbitrary tree expressions via range_of_expr.
> 
> if you have 32 bit ranges that you want to do 16 bit addition on, you can also
> cast those ranges to a 16bit type,
> 
> my32bitrange.cast (my16bittype);
> 
> then invoke range-ops directly via getting the handler:
> 
> handler = range_op_handler (PLUS_EXPR, 16bittype_tree); if (handler)
>     handler->fold (result, my16bittype, mycasted32bitrange,
> myothercasted32bitrange)
> 
> There are higher level APIs if what you have on hand is closer to IL than
> random ranges
> 
> Describe exactly what it is you want to do... and I'll try to direct you to the
> best way to do it.

The vectorizer has  a pattern matcher that runs at startup on the scalar code.
This pattern matcher can replace one or more statements with alternative ones,
these can be either existing tree_code or new internal functions.

One of the patterns here is a overwidening detection pattern which reduces the
precision that an operation is to be done in during vectorization.

Another one is widening multiplication, which replaced PLUS_EXPR with WIDEN_PLUS_EXPR.

These can be chained, so e.g. a widening addition done on ints can be reduced to a widen addition
done on shorts.

The question is whether given the new expression that the vectorizer has
created whether ranger can tell what the precision is.  get_range_query fails because presumably
it has no idea about the new operations created
 and also doesn't know about any new IFNs.

Thanks,
Tamar

> 
> Andrew
> 
>
Tamar Christina Feb. 15, 2023, 12:51 p.m. UTC | #19
> > >>> In any case, if you disagree I don’t' really see a way forward
> > >>> aside from making this its own pattern running it before the
> > >>> overwidening
> > pattern.
> > >> I think we should look to see if ranger can be persuaded to provide
> > >> the range of the 16-bit addition, even though the statement that
> > >> produces it isn't part of a BB.  It shouldn't matter that the
> > >> addition originally came from a 32-bit one: the range follows
> > >> directly from the ranges of the operands (i.e. the fact that the
> > >> operands are the results of widening conversions).
> > > I think you can ask ranger on operations on names defined in the IL,
> > > so you can work yourself through the sequence of operations in the
> > > pattern sequence to compute ranges on their defs (and possibly even
> > > store them in the SSA info).  You just need to pick the correct
> > > ranger API for this…. Andrew CCed
> > >
> > >
> > Its not clear to me whats being asked...
> >
> > Expressions don't need to be in the IL to do range calculations.. I
> > believe we support arbitrary tree expressions via range_of_expr.
> >
> > if you have 32 bit ranges that you want to do 16 bit addition on, you
> > can also cast those ranges to a 16bit type,
> >
> > my32bitrange.cast (my16bittype);
> >
> > then invoke range-ops directly via getting the handler:
> >
> > handler = range_op_handler (PLUS_EXPR, 16bittype_tree); if (handler)
> >     handler->fold (result, my16bittype, mycasted32bitrange,
> > myothercasted32bitrange)
> >
> > There are higher level APIs if what you have on hand is closer to IL
> > than random ranges
> >
> > Describe exactly what it is you want to do... and I'll try to direct
> > you to the best way to do it.
> 
> The vectorizer has  a pattern matcher that runs at startup on the scalar code.
> This pattern matcher can replace one or more statements with alternative
> ones, these can be either existing tree_code or new internal functions.
> 
> One of the patterns here is a overwidening detection pattern which reduces
> the precision that an operation is to be done in during vectorization.
> 
> Another one is widening multiplication, which replaced PLUS_EXPR with
> WIDEN_PLUS_EXPR.
> 
> These can be chained, so e.g. a widening addition done on ints can be
> reduced to a widen addition done on shorts.
> 
> The question is whether given the new expression that the vectorizer has
> created whether ranger can tell what the precision is.  get_range_query fails
> because presumably it has no idea about the new operations created  and
> also doesn't know about any new IFNs.

Hi,

I have been trying to use ranger as requested. I've tried:

	  gimple_ranger ranger;
	  int_range_max r;
	  /* Check that no overflow will occur.  If we don't have range
	     information we can't perform the optimization.  */
	  if (ranger.range_of_expr (r, oprnd0, stmt))
	    {
	      wide_int max = r.upper_bound ();
                    ....

Which works for non-patterns, but still doesn't work for patterns.
On a stmt:
patt_27 = (_3) w+ (level_15(D));

it gives me a range:

$2 = {
  <wide_int_storage> = {
    val = {[0x0] = 0xffffffffffffffff, [0x1] = 0x7fff95bd8b00, [0x2] = 0x7fff95bd78b0, [0x3] = 0x3fa1dd0, [0x4] = 0x3fa1dd0, [0x5] = 0x344a706f832d4f00, [0x6] = 0x7fff95bd7950, [0x7] = 0x1ae7f11, [0x8] = 0x7fff95bd79f8},
    len = 0x1,
    precision = 0x10
  },
  members of generic_wide_int<wide_int_storage>:
  static is_sign_extended = 0x1
}

The precision is fine, but range seems to be -1?

Should I use range_op_handler (WIDEN_PLUS_EXPR, ...) in this case?

Thanks,
Tamar

> 
> Thanks,
> Tamar
> 
> >
> > Andrew
> >
> >
Andrew MacLeod Feb. 15, 2023, 4:05 p.m. UTC | #20
On 2/15/23 07:51, Tamar Christina wrote:
>>>>>> In any case, if you disagree I don’t' really see a way forward
>>>>>> aside from making this its own pattern running it before the
>>>>>> overwidening
>>> pattern.
>>>>> I think we should look to see if ranger can be persuaded to provide
>>>>> the range of the 16-bit addition, even though the statement that
>>>>> produces it isn't part of a BB.  It shouldn't matter that the
>>>>> addition originally came from a 32-bit one: the range follows
>>>>> directly from the ranges of the operands (i.e. the fact that the
>>>>> operands are the results of widening conversions).
>>>> I think you can ask ranger on operations on names defined in the IL,
>>>> so you can work yourself through the sequence of operations in the
>>>> pattern sequence to compute ranges on their defs (and possibly even
>>>> store them in the SSA info).  You just need to pick the correct
>>>> ranger API for this…. Andrew CCed
>>>>
>>>>
>>> Its not clear to me whats being asked...
>>>
>>> Expressions don't need to be in the IL to do range calculations.. I
>>> believe we support arbitrary tree expressions via range_of_expr.
>>>
>>> if you have 32 bit ranges that you want to do 16 bit addition on, you
>>> can also cast those ranges to a 16bit type,
>>>
>>> my32bitrange.cast (my16bittype);
>>>
>>> then invoke range-ops directly via getting the handler:
>>>
>>> handler = range_op_handler (PLUS_EXPR, 16bittype_tree); if (handler)
>>>      handler->fold (result, my16bittype, mycasted32bitrange,
>>> myothercasted32bitrange)
>>>
>>> There are higher level APIs if what you have on hand is closer to IL
>>> than random ranges
>>>
>>> Describe exactly what it is you want to do... and I'll try to direct
>>> you to the best way to do it.
>> The vectorizer has  a pattern matcher that runs at startup on the scalar code.
>> This pattern matcher can replace one or more statements with alternative
>> ones, these can be either existing tree_code or new internal functions.
>>
>> One of the patterns here is a overwidening detection pattern which reduces
>> the precision that an operation is to be done in during vectorization.
>>
>> Another one is widening multiplication, which replaced PLUS_EXPR with
>> WIDEN_PLUS_EXPR.
>>
>> These can be chained, so e.g. a widening addition done on ints can be
>> reduced to a widen addition done on shorts.
>>
>> The question is whether given the new expression that the vectorizer has
>> created whether ranger can tell what the precision is.  get_range_query fails
>> because presumably it has no idea about the new operations created  and
>> also doesn't know about any new IFNs.
> Hi,
>
> I have been trying to use ranger as requested. I've tried:
>
> 	  gimple_ranger ranger;
> 	  int_range_max r;
> 	  /* Check that no overflow will occur.  If we don't have range
> 	     information we can't perform the optimization.  */
> 	  if (ranger.range_of_expr (r, oprnd0, stmt))
> 	    {
> 	      wide_int max = r.upper_bound ();
>                      ....
>
> Which works for non-patterns, but still doesn't work for patterns.
> On a stmt:
> patt_27 = (_3) w+ (level_15(D));
>
> it gives me a range:
>
> $2 = {
>    <wide_int_storage> = {
>      val = {[0x0] = 0xffffffffffffffff, [0x1] = 0x7fff95bd8b00, [0x2] = 0x7fff95bd78b0, [0x3] = 0x3fa1dd0, [0x4] = 0x3fa1dd0, [0x5] = 0x344a706f832d4f00, [0x6] = 0x7fff95bd7950, [0x7] = 0x1ae7f11, [0x8] = 0x7fff95bd79f8},
>      len = 0x1,
>      precision = 0x10
>    },
>    members of generic_wide_int<wide_int_storage>:
>    static is_sign_extended = 0x1
> }
>
> The precision is fine, but range seems to be -1?
>
> Should I use range_op_handler (WIDEN_PLUS_EXPR, ...) in this case?

Its easier to see the range if you dump it.. ie:

p r.dump(stderr)

Im way behind the curve on exactly whats going on.  Im not sure how the 
above 2 things relate..  I presume $2 is is 'max'?  I have no context, 
what did you expect the range of _3 to be?

We have no entry in range-ops.cc for a WIDEN_PLUS_EXPR,  so ranger would 
only give back a VARYING for that no doubt.. however I doubt it would be 
too difficult to write the fold_range() method for it.

Its unclear to me what you mean by it doesnt work on patterns. so lets 
do some basics.

You have a stmt  "patt_27 = (_3) w+ (level_15(D));"

I gather thats a WIDEN_PLUS_EXPR, and if I read it right, patt_27 is a 
type that is twice as wide as _3, and will contain the value "_3 + 
level_15"?

You query above is asking for the range of _3 at this stmt in the IL.

And you are trying to determine whether the expression "_3 + level_15" 
would still fit in the type of _3, and thus you could avoid the WIDEN_* 
paradigm and revert to a simply plus?

And you also want to be able to do this for expressions which are not 
currently in the IL?

----  IF that is all true, then I would suggest one of 2 possible routes.
1) we add WIDEN_PLUS_EXPR to range-ops.  THIs involves writing 
fold_range() for it whereby it would create a range of a type double the 
precision of _3, then take the 2 ranges for op1 and op2, cast them to 
this new type and add them.

2) manually doing the same thing.   BUt if you are goignto manually do 
it, we might as well put that same code into fold_range then the entire 
ecosystem will benefit.

Once the operation can be performed in range ops, you can cast the new 
range back to the type of _3 and see if its fully represented. ie

int_range_max r1, r2
if (ranger.range_of_stmt (r1, stmt))
   {
     r2 = r1;
     r2.cast (TREE_TYPE (_3));
     r2.cast (TREE_TYPE (patt_27));
     if (r1 == r2)
       // No info was lost casting back and forth, so r1 must fit into 
type of _3

That should work for within the IL.  And if you want to do the same 
thing outside of the IL, you have to come up with the values you want to 
use for op1 and op2, replace the ranger query with a direct range-opfold:

range_op_handler handler (WIDEN_PLUS_EXPR, TREE_TYPE (patt_27));
if (handler && handler->fold_range (r1, range_of__3, range_of_level_15))
   {
     // same casting song and dance


If you dont want to go thru this process, in theory, you could try 
simply adding _3 and level_15 in their own precision, and if max/min 
aren't +INF/-INF then you can probably assume there is no overflow?
in which case, all you do is the path you are on above for within a stmt 
should work:

	  gimple_ranger ranger;
	  int_range_max r0, r1, def;
	  /* Check that no overflow will occur.  If we don't have range
	     information we can't perform the optimization.  */
	  if (ranger.range_of_expr (r0, oprnd0, stmt) && ranger.range_of_expr (r1,oprnd1, stmt)
	    {
	      range_op_handler handler (PLUS_EXPR, TREE_TYPE (_3));
	      if (handler && handler->fold_range (def, r0, r1))
		// examine def.upper_bound() and def.lower_bound()

Am I grasping some of the issue here?

Andrew
Tamar Christina Feb. 15, 2023, 5:13 p.m. UTC | #21
> On 2/15/23 07:51, Tamar Christina wrote:
> >>>>>> In any case, if you disagree I don’t' really see a way forward
> >>>>>> aside from making this its own pattern running it before the
> >>>>>> overwidening
> >>> pattern.
> >>>>> I think we should look to see if ranger can be persuaded to
> >>>>> provide the range of the 16-bit addition, even though the
> >>>>> statement that produces it isn't part of a BB.  It shouldn't
> >>>>> matter that the addition originally came from a 32-bit one: the
> >>>>> range follows directly from the ranges of the operands (i.e. the
> >>>>> fact that the operands are the results of widening conversions).
> >>>> I think you can ask ranger on operations on names defined in the
> >>>> IL, so you can work yourself through the sequence of operations in
> >>>> the pattern sequence to compute ranges on their defs (and possibly
> >>>> even store them in the SSA info).  You just need to pick the
> >>>> correct ranger API for this…. Andrew CCed
> >>>>
> >>>>
> >>> Its not clear to me whats being asked...
> >>>
> >>> Expressions don't need to be in the IL to do range calculations.. I
> >>> believe we support arbitrary tree expressions via range_of_expr.
> >>>
> >>> if you have 32 bit ranges that you want to do 16 bit addition on,
> >>> you can also cast those ranges to a 16bit type,
> >>>
> >>> my32bitrange.cast (my16bittype);
> >>>
> >>> then invoke range-ops directly via getting the handler:
> >>>
> >>> handler = range_op_handler (PLUS_EXPR, 16bittype_tree); if (handler)
> >>>      handler->fold (result, my16bittype, mycasted32bitrange,
> >>> myothercasted32bitrange)
> >>>
> >>> There are higher level APIs if what you have on hand is closer to IL
> >>> than random ranges
> >>>
> >>> Describe exactly what it is you want to do... and I'll try to direct
> >>> you to the best way to do it.
> >> The vectorizer has  a pattern matcher that runs at startup on the scalar
> code.
> >> This pattern matcher can replace one or more statements with
> >> alternative ones, these can be either existing tree_code or new internal
> functions.
> >>
> >> One of the patterns here is a overwidening detection pattern which
> >> reduces the precision that an operation is to be done in during
> vectorization.
> >>
> >> Another one is widening multiplication, which replaced PLUS_EXPR with
> >> WIDEN_PLUS_EXPR.
> >>
> >> These can be chained, so e.g. a widening addition done on ints can be
> >> reduced to a widen addition done on shorts.
> >>
> >> The question is whether given the new expression that the vectorizer
> >> has created whether ranger can tell what the precision is.
> >> get_range_query fails because presumably it has no idea about the new
> >> operations created  and also doesn't know about any new IFNs.
> > Hi,
> >
> > I have been trying to use ranger as requested. I've tried:
> >
> > 	  gimple_ranger ranger;
> > 	  int_range_max r;
> > 	  /* Check that no overflow will occur.  If we don't have range
> > 	     information we can't perform the optimization.  */
> > 	  if (ranger.range_of_expr (r, oprnd0, stmt))
> > 	    {
> > 	      wide_int max = r.upper_bound ();
> >                      ....
> >
> > Which works for non-patterns, but still doesn't work for patterns.
> > On a stmt:
> > patt_27 = (_3) w+ (level_15(D));
> >
> > it gives me a range:
> >
> > $2 = {
> >    <wide_int_storage> = {
> >      val = {[0x0] = 0xffffffffffffffff, [0x1] = 0x7fff95bd8b00, [0x2] =
> 0x7fff95bd78b0, [0x3] = 0x3fa1dd0, [0x4] = 0x3fa1dd0, [0x5] =
> 0x344a706f832d4f00, [0x6] = 0x7fff95bd7950, [0x7] = 0x1ae7f11, [0x8] =
> 0x7fff95bd79f8},
> >      len = 0x1,
> >      precision = 0x10
> >    },
> >    members of generic_wide_int<wide_int_storage>:
> >    static is_sign_extended = 0x1
> > }
> >
> > The precision is fine, but range seems to be -1?
> >
> > Should I use range_op_handler (WIDEN_PLUS_EXPR, ...) in this case?
> 
> Its easier to see the range if you dump it.. ie:
> 
> p r.dump(stderr)
> 
> Im way behind the curve on exactly whats going on.  Im not sure how the
> above 2 things relate..  I presume $2 is is 'max'?  I have no context, what did
> you expect the range of _3 to be?

Yes, $2 is max, and the expected range is 0x1fe as it's unsigned addition.
I'll expand below.

> 
> We have no entry in range-ops.cc for a WIDEN_PLUS_EXPR,  so ranger would
> only give back a VARYING for that no doubt.. however I doubt it would be
> too difficult to write the fold_range() method for it.
> 
> Its unclear to me what you mean by it doesnt work on patterns. so lets do
> some basics.
> 
> You have a stmt  "patt_27 = (_3) w+ (level_15(D));"
> 
> I gather thats a WIDEN_PLUS_EXPR, and if I read it right, patt_27 is a type
> that is twice as wide as _3, and will contain the value "_3 + level_15"?
> 
> You query above is asking for the range of _3 at this stmt in the IL.
> 
> And you are trying to determine whether the expression "_3 + level_15"
> would still fit in the type of _3, and thus you could avoid the WIDEN_*
> paradigm and revert to a simply plus?
> 
> And you also want to be able to do this for expressions which are not
> currently in the IL?

A pattern is an alternative IL that the vectorizer introduces for if it were to
vectorize a particular scalar statement(s).  The scalar statement(s) is not replaced
in the original scalar IL but it is replaced in the IL the vectorizer uses.

The example I'm working with here is this

#define N 16
void fun2(uint8_t* restrict pixel, uint8_t level, int n)
{
  for (int i = 0; i < n; i+=1)
    pixel[i] = (pixel[i] + level) / 0xff;
}

Where the C promotion rules promotes the operands to int.  However when
vectoring we try to increase the VF of the loop, that is do the operation in the
smallest type possible.  In this case it is safe to do the operation as a short.

So the vectorizer demotes 

  _28 = (int) level_14(D);
  _4 = (int) _3;
  _6 = _4 + _28;
  _7 = _6 / 255;
  _8 = (unsigned char) _7;

Into

  _28 = (short) level_14(D);
  _4 = (short) _3;
  _6 = _4 + _28;
  _7 = _6 / 255;
  _8 = (unsigned char) _7;

This is done in the scalar pattern matcher it has based on range information.
The new instructions replace the old ones in the vectorizer IL.

There is then a second pattern matcher that runs because some targets have
operations that can perform a widening during a mathematical operation.
There are many such operations, +w, *w, >>w, <<w etc.  Some are tree codes,
others are represented as internal function calls.

This second pattern replaces the above into:

  _6 = _3 +w level_14(D);
  _7 = _6 / 255;
  _8 = (unsigned char) _7;

Thus removing the need to promote before the addition.  What I'm working on
is an optimization for division.  So I am after what the range of _6 is. oprnd0 in my
example is the 1rst operand of the division.

I need to know the range of_6 because based on the range we can optimize this
division into something much more efficient.

> 
> ----  IF that is all true, then I would suggest one of 2 possible routes.
> 1) we add WIDEN_PLUS_EXPR to range-ops.  THIs involves writing
> fold_range() for it whereby it would create a range of a type double the
> precision of _3, then take the 2 ranges for op1 and op2, cast them to this new
> type and add them.
> 

Right, so I guess none of the widening operations are currently there.  Can you
point me in the right direction of where I need to add them?

> 2) manually doing the same thing.   BUt if you are goignto manually do it, we
> might as well put that same code into fold_range then the entire ecosystem
> will benefit.
> 
> Once the operation can be performed in range ops, you can cast the new
> range back to the type of _3 and see if its fully represented. ie
> 
> int_range_max r1, r2
> if (ranger.range_of_stmt (r1, stmt))
>    {
>      r2 = r1;
>      r2.cast (TREE_TYPE (_3));
>      r2.cast (TREE_TYPE (patt_27));
>      if (r1 == r2)
>        // No info was lost casting back and forth, so r1 must fit into type of _3
> 
> That should work for within the IL.  And if you want to do the same thing
> outside of the IL, you have to come up with the values you want to use for
> op1 and op2, replace the ranger query with a direct range-opfold:
> 
> range_op_handler handler (WIDEN_PLUS_EXPR, TREE_TYPE (patt_27)); if
> (handler && handler->fold_range (r1, range_of__3, range_of_level_15))
>    {
>      // same casting song and dance
> 
> 

Just for my own understanding, does the fold_range here update the information
in the IL? Or is it just for this computation? So when I hit this pattern again it
recomputes it?

> If you dont want to go thru this process, in theory, you could try simply
> adding _3 and level_15 in their own precision, and if max/min aren't +INF/-
> INF then you can probably assume there is no overflow?
> in which case, all you do is the path you are on above for within a stmt should
> work:
> 
> 	  gimple_ranger ranger;
> 	  int_range_max r0, r1, def;
> 	  /* Check that no overflow will occur.  If we don't have range
> 	     information we can't perform the optimization.  */
> 	  if (ranger.range_of_expr (r0, oprnd0, stmt) &&
> ranger.range_of_expr (r1,oprnd1, stmt)
> 	    {
> 	      range_op_handler handler (PLUS_EXPR, TREE_TYPE (_3));
> 	      if (handler && handler->fold_range (def, r0, r1))
> 		// examine def.upper_bound() and def.lower_bound()
> 
> Am I grasping some of the issue here?

You are, and this was helpful.  I would imagine that Richard wouldn't accept me
to do it locally though.  So I guess if it's safe to do for this PR fix, I can add the basic
widening operations to ranger-ops if you can show me where.

Thanks,
Tamar

> 
> Andrew
> 
>
Andrew MacLeod Feb. 15, 2023, 5:50 p.m. UTC | #22
On 2/15/23 12:13, Tamar Christina wrote:
>> On 2/15/23 07:51, Tamar Christina wrote:
>>
Thanks, lots of useful context there.


> This second pattern replaces the above into:
>
>    _6 = _3 +w level_14(D);
>    _7 = _6 / 255;
>    _8 = (unsigned char) _7;
>
> Thus removing the need to promote before the addition.  What I'm working on
> is an optimization for division.  So I am after what the range of _6 is. oprnd0 in my
> example is the 1rst operand of the division.
>
> I need to know the range of_6 because based on the range we can optimize this
> division into something much more efficient.
>
>> ----  IF that is all true, then I would suggest one of 2 possible routes.
>> 1) we add WIDEN_PLUS_EXPR to range-ops.  THIs involves writing
>> fold_range() for it whereby it would create a range of a type double the
>> precision of _3, then take the 2 ranges for op1 and op2, cast them to this new
>> type and add them.
>>
> Right, so I guess none of the widening operations are currently there.  Can you
> point me in the right direction of where I need to add them?

sure, details below


>> 2) manually doing the same thing.   BUt if you are goignto manually do it, we
>> might as well put that same code into fold_range then the entire ecosystem
>> will benefit.
>>
>> Once the operation can be performed in range ops, you can cast the new
>> range back to the type of _3 and see if its fully represented. ie
>>
>> int_range_max r1, r2
>> if (ranger.range_of_stmt (r1, stmt))
>>     {
>>       r2 = r1;
>>       r2.cast (TREE_TYPE (_3));
>>       r2.cast (TREE_TYPE (patt_27));
>>       if (r1 == r2)
>>         // No info was lost casting back and forth, so r1 must fit into type of _3
>>
>> That should work for within the IL.  And if you want to do the same thing
>> outside of the IL, you have to come up with the values you want to use for
>> op1 and op2, replace the ranger query with a direct range-opfold:
>>
>> range_op_handler handler (WIDEN_PLUS_EXPR, TREE_TYPE (patt_27)); if
>> (handler && handler->fold_range (r1, range_of__3, range_of_level_15))
>>     {
>>       // same casting song and dance
>>
>>
> Just for my own understanding, does the fold_range here update the information
> in the IL? Or is it just for this computation? So when I hit this pattern again it
> recomputes it?

fold_range does not update anything.  It just performs the calculation, 
and passes like VRP etc are responsible for if, and when, that is 
reflected in some way/transformation in the IL. The IL is primarily used 
for context to look back and try to determine the range of the inputs to 
the statement.   Thats why, if you arent using an expression in the IL, 
you need to provide the ranges yourself.   BY default, you end up with 
the full range for the type, ie VARYING.  but if ranger can detertmine 
through branches and such that its something different, it will. ie, so 
if you case is preceeded by

if (_3 < 20 && level_15< 20)
   //  the range of _3 will be [0, 19] and _15 will be [0, 19], and th 
addition will end up with a range of [0, 38]

In your case, I see the ranges are the range of the 8 bit type: irange] 
int [0, 255] NONZERO 0xff



>> If you dont want to go thru this process, in theory, you could try simply
>> adding _3 and level_15 in their own precision, and if max/min aren't +INF/-
>> INF then you can probably assume there is no overflow?
>> in which case, all you do is the path you are on above for within a stmt should
>> work:
>>
>> 	  gimple_ranger ranger;
>> 	  int_range_max r0, r1, def;
>> 	  /* Check that no overflow will occur.  If we don't have range
>> 	     information we can't perform the optimization.  */
>> 	  if (ranger.range_of_expr (r0, oprnd0, stmt) &&
>> ranger.range_of_expr (r1,oprnd1, stmt)
>> 	    {
>> 	      range_op_handler handler (PLUS_EXPR, TREE_TYPE (_3));
>> 	      if (handler && handler->fold_range (def, r0, r1))so I would expect a skeleton to be
>> 		// examine def.upper_bound() and def.lower_bound()
>>
>> Am I grasping some of the issue here?
> You are, and this was helpful.  I would imagine that Richard wouldn't accept me
> to do it locally though.  So I guess if it's safe to do for this PR fix, I can add the basic
> widening operations to ranger-ops if you can show me where.
>

all the range-op integer code is in gcc/range-op.cc.  As this is a basic 
binary operation, you should be able to get away with implementing a 
single routine,  wi_fold () which adds 2 wide int bounds  together and 
returns a result.  THis si the implelemntaion for operator_plus.

void
operator_plus::wi_fold (irange &r, tree type,
                         const wide_int &lh_lb, const wide_int &lh_ub,
                         const wide_int &rh_lb, const wide_int &rh_ub) const
{
   wi::overflow_type ov_lb, ov_ub;
   signop s = TYPE_SIGN (type);
   wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
   wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);
   value_range_with_overflow (r, type, new_lb, new_ub, ov_lb, ov_ub);
}


you shouldn't have to do any of the overflow stuff at the end, just take 
the 2 sets of wide int, double their precision to start, add them 
together (it cant possible overflow right) and then return an 
int_range<2> with those bounds...
ie

void
operator_plus::wi_fold (irange &r, tree type,
                         const wide_int &lh_lb, const wide_int &lh_ub,
                         const wide_int &rh_lb, const wide_int &rh_ub) const
{
   wi::overflow_type ov_lb, ov_ub;
   signop s = TYPE_SIGN (type);

   // Do whatever wideint magic is required to do this adds in higher 
precision
   wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
   wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);

   r = int_range<2> (type, new_lb, new_ub);
}


The operator needs to be registered, I've attached the skeleton for it.  
you should just have to finish implementing wi_fold().

in theory :-)
Andrew MacLeod Feb. 15, 2023, 6:42 p.m. UTC | #23
On 2/15/23 12:50, Andrew MacLeod wrote:
>
> On 2/15/23 12:13, Tamar Christina wrote:
>>> On 2/15/23 07:51, Tamar Christina wrote:
> void
> operator_plus::wi_fold (irange &r, tree type,
>                         const wide_int &lh_lb, const wide_int &lh_ub,
>                         const wide_int &rh_lb, const wide_int &rh_ub) 
> const
> {
>   wi::overflow_type ov_lb, ov_ub;
>   signop s = TYPE_SIGN (type);
>
>   // Do whatever wideint magic is required to do this adds in higher 
> precision
>   wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
>   wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);
>
>   r = int_range<2> (type, new_lb, new_ub);
> }
>
>
> The operator needs to be registered, I've attached the skeleton for 
> it.  you should just have to finish implementing wi_fold().
>
> in theory :-)
>
You also mentioned earlier that some were tree codes, some were internal 
function calls?  We have some initial support for built in functions, 
but I am not familiar with all the various forms they can take.  We 
currently support CFN_ functions in

   gimple-range-op.cc, gimple_range_op_handler::maybe_builtin_call ()

Basically this is part of a "gimple_range_op_handler"  wrapper for 
range-ops which can provide a range-ops class for stmts that don't map 
to a binary or unary form.. such as built in functions.

If you get to the point where you need this for a builtin function, I 
can help you through that too.  Although someone may have to also help 
me through what differentiates the different kinds of internal function 
:-)    I presume they are all similar in some way.

Andrew
Tamar Christina Feb. 22, 2023, 12:51 p.m. UTC | #24
> -----Original Message-----
> From: Andrew MacLeod <amacleod@redhat.com>
> Sent: Wednesday, February 15, 2023 6:43 PM
> To: Tamar Christina <Tamar.Christina@arm.com>; Richard Biener
> <rguenther@suse.de>; Richard Sandiford <Richard.Sandiford@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <nd@arm.com>; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> 
> On 2/15/23 12:50, Andrew MacLeod wrote:
> >
> > On 2/15/23 12:13, Tamar Christina wrote:
> >>> On 2/15/23 07:51, Tamar Christina wrote:
> > void
> > operator_plus::wi_fold (irange &r, tree type,
> >                         const wide_int &lh_lb, const wide_int &lh_ub,
> >                         const wide_int &rh_lb, const wide_int &rh_ub)
> > const {
> >   wi::overflow_type ov_lb, ov_ub;
> >   signop s = TYPE_SIGN (type);
> >
> >   // Do whatever wideint magic is required to do this adds in higher
> > precision
> >   wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
> >   wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);
> >
> >   r = int_range<2> (type, new_lb, new_ub); }
> >
> >
> > The operator needs to be registered, I've attached the skeleton for
> > it.  you should just have to finish implementing wi_fold().
> >
> > in theory :-)
> >
> You also mentioned earlier that some were tree codes, some were internal
> function calls?  We have some initial support for built in functions,
> but I am not familiar with all the various forms they can take.  We
> currently support CFN_ functions in

Ah then this should work, CFNs is are e helper class to combine compiler builtins
and internal functions in one structure.  So with support for CFN_ both should
be supported.  Probably just a matter of adding the new ops then.

> 
>    gimple-range-op.cc, gimple_range_op_handler::maybe_builtin_call ()
> 
> Basically this is part of a "gimple_range_op_handler"  wrapper for
> range-ops which can provide a range-ops class for stmts that don't map
> to a binary or unary form.. such as built in functions.
> 
> If you get to the point where you need this for a builtin function, I
> can help you through that too.  Although someone may have to also help
> me through what differentiates the different kinds of internal function
> :-)    I presume they are all similar in some way.

Will do! I'm hoping to range related vectorizer missed optimizations with this in
GCC 14 so I'll be back 😊

Cheers,
Tamar
> 
> Andrew
>
Tamar Christina Feb. 22, 2023, 1:06 p.m. UTC | #25
Hi Andrew,

> 
> all the range-op integer code is in gcc/range-op.cc.  As this is a basic
> binary operation, you should be able to get away with implementing a
> single routine,  wi_fold () which adds 2 wide int bounds  together and
> returns a result.  THis si the implelemntaion for operator_plus.
> 
> void
> operator_plus::wi_fold (irange &r, tree type,
>                          const wide_int &lh_lb, const wide_int &lh_ub,
>                          const wide_int &rh_lb, const wide_int &rh_ub) const
> {
>    wi::overflow_type ov_lb, ov_ub;
>    signop s = TYPE_SIGN (type);
>    wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
>    wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);
>    value_range_with_overflow (r, type, new_lb, new_ub, ov_lb, ov_ub);
> }
> 
> 
> you shouldn't have to do any of the overflow stuff at the end, just take
> the 2 sets of wide int, double their precision to start, add them
> together (it cant possible overflow right) and then return an
> int_range<2> with those bounds...
> ie
> 
> void
> operator_plus::wi_fold (irange &r, tree type,
>                          const wide_int &lh_lb, const wide_int &lh_ub,
>                          const wide_int &rh_lb, const wide_int &rh_ub) const
> {
>    wi::overflow_type ov_lb, ov_ub;
>    signop s = TYPE_SIGN (type);
> 
>    // Do whatever wideint magic is required to do this adds in higher
> precision
>    wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
>    wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);
> 
>    r = int_range<2> (type, new_lb, new_ub);
> }
> 

So I've been working on adding support for widening plus and widening multiplication,
and my examples all work now.. but during bootstrap I hit a problem.

Say you have a mixed sign widening multiplication, such as in:

int decMultiplyOp_zacc, decMultiplyOp_iacc;
int *decMultiplyOp_lp;
void decMultiplyOp() {
  decMultiplyOp_lp = &decMultiplyOp_zacc;
  for (; decMultiplyOp_lp < &decMultiplyOp_zacc + decMultiplyOp_iacc;
       decMultiplyOp_lp++)
    *decMultiplyOp_lp = 0;
}

Eventually the pointer arithmetic will generate:

intD.7 decMultiplyOp_iacc.2_13;
long unsigned intD.11 _15;
_15 = decMultiplyOp_iacc.2_13 w* 4;
and it'll try to get the range from this.

My implementation is just:

void
operator_widen_mult::wi_fold (irange &r, tree type,
			const wide_int &lh_lb, const wide_int &lh_ub,
			const wide_int &rh_lb, const wide_int &rh_ub) const
{
  signop s = TYPE_SIGN (type);

  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, s);
  wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
  wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, s);
  wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);

  /* We don't expect a widening multiplication to be able to overflow but range
     calculations for multiplications are complicated.  After widening the
     operands lets call the base class.  */
  return operator_mult::wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
}

But in this case the operands are different types and the wi_fold only gets the
type of the operation. The issue is that when increasing the precision for lh_*
I need to sign extend the value and not zero extend, but I don't seem to have
enough context here to know that I do.  I'm missing the type of the operands.

For non-widening operations this doesn't matter as the precision stays the same.

Is there a way to get the information I need?

Thanks,
Tamar

> 
> The operator needs to be registered, I've attached the skeleton for it.
> you should just have to finish implementing wi_fold().
> 
> in theory :-)
Andrew MacLeod Feb. 22, 2023, 3:19 p.m. UTC | #26
On 2/22/23 08:06, Tamar Christina wrote:
> Hi Andrew,
>
>> all the range-op integer code is in gcc/range-op.cc.  As this is a basic
>> binary operation, you should be able to get away with implementing a
>> single routine,  wi_fold () which adds 2 wide int bounds  together and
>> returns a result.  THis si the implelemntaion for operator_plus.
>>
>> void
>> operator_plus::wi_fold (irange &r, tree type,
>>                           const wide_int &lh_lb, const wide_int &lh_ub,
>>                           const wide_int &rh_lb, const wide_int &rh_ub) const
>> {
>>     wi::overflow_type ov_lb, ov_ub;
>>     signop s = TYPE_SIGN (type);
>>     wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
>>     wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);
>>     value_range_with_overflow (r, type, new_lb, new_ub, ov_lb, ov_ub);
>> }
>>
>>
>> you shouldn't have to do any of the overflow stuff at the end, just take
>> the 2 sets of wide int, double their precision to start, add them
>> together (it cant possible overflow right) and then return an
>> int_range<2> with those bounds...
>> ie
>>
>> void
>> operator_plus::wi_fold (irange &r, tree type,
>>                           const wide_int &lh_lb, const wide_int &lh_ub,
>>                           const wide_int &rh_lb, const wide_int &rh_ub) const
>> {
>>     wi::overflow_type ov_lb, ov_ub;
>>     signop s = TYPE_SIGN (type);
>>
>>     // Do whatever wideint magic is required to do this adds in higher
>> precision
>>     wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
>>     wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);
>>
>>     r = int_range<2> (type, new_lb, new_ub);
>> }
>>
> So I've been working on adding support for widening plus and widening multiplication,
> and my examples all work now.. but during bootstrap I hit a problem.
>
> Say you have a mixed sign widening multiplication, such as in:
>
> int decMultiplyOp_zacc, decMultiplyOp_iacc;
> int *decMultiplyOp_lp;
> void decMultiplyOp() {
>    decMultiplyOp_lp = &decMultiplyOp_zacc;
>    for (; decMultiplyOp_lp < &decMultiplyOp_zacc + decMultiplyOp_iacc;
>         decMultiplyOp_lp++)
>      *decMultiplyOp_lp = 0;
> }
>
> Eventually the pointer arithmetic will generate:
>
> intD.7 decMultiplyOp_iacc.2_13;
> long unsigned intD.11 _15;
> _15 = decMultiplyOp_iacc.2_13 w* 4;
> and it'll try to get the range from this.
>
> My implementation is just:
>
> void
> operator_widen_mult::wi_fold (irange &r, tree type,
> 			const wide_int &lh_lb, const wide_int &lh_ub,
> 			const wide_int &rh_lb, const wide_int &rh_ub) const
> {
>    signop s = TYPE_SIGN (type);
>
>    wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, s);
>    wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
>    wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, s);
>    wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
>
>    /* We don't expect a widening multiplication to be able to overflow but range
>       calculations for multiplications are complicated.  After widening the
>       operands lets call the base class.  */
>    return operator_mult::wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
> }
>
> But in this case the operands are different types and the wi_fold only gets the
> type of the operation. The issue is that when increasing the precision for lh_*
> I need to sign extend the value and not zero extend, but I don't seem to have
> enough context here to know that I do.  I'm missing the type of the operands.
>
> For non-widening operations this doesn't matter as the precision stays the same.
>
> Is there a way to get the information I need?
>
>
we haven't had this situation before, if I understand it correctly:

The LHS is a different type than both the operands, and your problem is 
you need to know the sign of at least operand1 in order to know whether 
to zero extend or to sign extend it?  huh. haven't run into that with 
any other bit of IL before :-P

Let me think about it.  I am loathe to change range-ops itself, but we 
may be able to leverage the builtin-function approach to dealing with 
something non-standard. At least for the moment to keep you going.

For the builtins, we provide a range-ops handler, *after* we look at the 
operands from within a gimple-context where we can still see the types, 
and  choose an appropriate handler.  so I'm thinking we provide 2 handlers,

operator_widen_mult_signed
operator_widen_mult_unsigned

chosen based on whether to sign extned or zero extend op1. look at the 
type of operand one, and return the appropriate handler. Let me give you 
a skeleton.  I *think* this should do it.

you can provide 2 versions of  operator_widen_mult in range-ops (so you 
can still inherit from operator_mult).  The should be exported I think, 
and the appropriate one should be called I think...

Give it a try and see if it works :-P.
Andrew MacLeod Feb. 22, 2023, 4:41 p.m. UTC | #27
On 2/15/23 13:42, Andrew MacLeod wrote:
>
> On 2/15/23 12:50, Andrew MacLeod wrote:
>>
>> On 2/15/23 12:13, Tamar Christina wrote:
>>>> On 2/15/23 07:51, Tamar Christina wrote:
>> void
>> operator_plus::wi_fold (irange &r, tree type,
>>                         const wide_int &lh_lb, const wide_int &lh_ub,
>>                         const wide_int &rh_lb, const wide_int &rh_ub) 
>> const
>> {
>>   wi::overflow_type ov_lb, ov_ub;
>>   signop s = TYPE_SIGN (type);
>>
>>   // Do whatever wideint magic is required to do this adds in higher 
>> precision
>>   wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
>>   wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);
>>
>>   r = int_range<2> (type, new_lb, new_ub);
>> }
>>
>>
>> The operator needs to be registered, I've attached the skeleton for 
>> it.  you should just have to finish implementing wi_fold().
>>
>> in theory :-)
>>
> You also mentioned earlier that some were tree codes, some were 
> internal function calls?  We have some initial support for built in 
> functions, but I am not familiar with all the various forms they can 
> take.  We currently support CFN_ functions in
>
>   gimple-range-op.cc, gimple_range_op_handler::maybe_builtin_call ()
>
> Basically this is part of a "gimple_range_op_handler"  wrapper for 
> range-ops which can provide a range-ops class for stmts that don't map 
> to a binary or unary form.. such as built in functions.
>
> If you get to the point where you need this for a builtin function, I 
> can help you through that too.  Although someone may have to also help 
> me through what differentiates the different kinds of internal 
> function :-)    I presume they are all similar in some way.
>
> Andrew
>
Oh yeah, and in case you haven't figured it out on your own, you'll have 
to remove WIDEN_MULT_EXPR from the range-ops init table.   This 
non-standard mechanism only gets checked if there is no standard 
range-op table entry for the tree code :-P

Andrew

Andrew
Tamar Christina Feb. 22, 2023, 6:03 p.m. UTC | #28
> -----Original Message-----
> From: Andrew MacLeod <amacleod@redhat.com>
> Sent: Wednesday, February 22, 2023 4:42 PM
> To: Tamar Christina <Tamar.Christina@arm.com>; Richard Biener
> <rguenther@suse.de>; Richard Sandiford <Richard.Sandiford@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <nd@arm.com>; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> 
> On 2/15/23 13:42, Andrew MacLeod wrote:
> >
> > On 2/15/23 12:50, Andrew MacLeod wrote:
> >>
> >> On 2/15/23 12:13, Tamar Christina wrote:
> >>>> On 2/15/23 07:51, Tamar Christina wrote:
> >> void
> >> operator_plus::wi_fold (irange &r, tree type,
> >>                         const wide_int &lh_lb, const wide_int &lh_ub,
> >>                         const wide_int &rh_lb, const wide_int &rh_ub)
> >> const {
> >>   wi::overflow_type ov_lb, ov_ub;
> >>   signop s = TYPE_SIGN (type);
> >>
> >>   // Do whatever wideint magic is required to do this adds in higher
> >> precision
> >>   wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
> >>   wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);
> >>
> >>   r = int_range<2> (type, new_lb, new_ub); }
> >>
> >>
> >> The operator needs to be registered, I've attached the skeleton for
> >> it.  you should just have to finish implementing wi_fold().
> >>
> >> in theory :-)
> >>
> > You also mentioned earlier that some were tree codes, some were
> > internal function calls?  We have some initial support for built in
> > functions, but I am not familiar with all the various forms they can
> > take.  We currently support CFN_ functions in
> >
> >   gimple-range-op.cc, gimple_range_op_handler::maybe_builtin_call ()
> >
> > Basically this is part of a "gimple_range_op_handler"  wrapper for
> > range-ops which can provide a range-ops class for stmts that don't map
> > to a binary or unary form.. such as built in functions.
> >
> > If you get to the point where you need this for a builtin function, I
> > can help you through that too.  Although someone may have to also help
> > me through what differentiates the different kinds of internal
> > function :-)    I presume they are all similar in some way.
> >
> > Andrew
> >
> Oh yeah, and in case you haven't figured it out on your own, you'll have
> to remove WIDEN_MULT_EXPR from the range-ops init table.   This
> non-standard mechanism only gets checked if there is no standard
> range-op table entry for the tree code :-P
> 

Hmm it looks like it'll work, but it keeps segfaulting in:

bool
range_op_handler::fold_range (vrange &r, tree type,
			      const vrange &lh,
			      const vrange &rh,
			      relation_trio rel) const
{
  gcc_checking_assert (m_valid);
  if (m_int)
    return m_int->fold_range (as_a <irange> (r), type,
			   as_a <irange> (lh),
			   as_a <irange> (rh), rel);

while trying to call fold_range.

But m_int is set to the right instance. Probably something I'm missing,
I'll double check it all.

Cheers,
Tamar
> Andrew
> 
> Andrew
Andrew MacLeod Feb. 22, 2023, 6:33 p.m. UTC | #29
On 2/22/23 13:03, Tamar Christina wrote:
>> -----Original Message-----
>> From: Andrew MacLeod <amacleod@redhat.com>
>> Sent: Wednesday, February 22, 2023 4:42 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>; Richard Biener
>> <rguenther@suse.de>; Richard Sandiford <Richard.Sandiford@arm.com>
>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> <nd@arm.com>; jlaw@ventanamicro.com
>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
>> by using new optabs [PR108583]
>>
>>
>> On 2/15/23 13:42, Andrew MacLeod wrote:
>>> On 2/15/23 12:50, Andrew MacLeod wrote:
>>>> On 2/15/23 12:13, Tamar Christina wrote:
>>>>>> On 2/15/23 07:51, Tamar Christina wrote:
>>>> void
>>>> operator_plus::wi_fold (irange &r, tree type,
>>>>                          const wide_int &lh_lb, const wide_int &lh_ub,
>>>>                          const wide_int &rh_lb, const wide_int &rh_ub)
>>>> const {
>>>>    wi::overflow_type ov_lb, ov_ub;
>>>>    signop s = TYPE_SIGN (type);
>>>>
>>>>    // Do whatever wideint magic is required to do this adds in higher
>>>> precision
>>>>    wide_int new_lb = wi::add (lh_lb, rh_lb, s, &ov_lb);
>>>>    wide_int new_ub = wi::add (lh_ub, rh_ub, s, &ov_ub);
>>>>
>>>>    r = int_range<2> (type, new_lb, new_ub); }
>>>>
>>>>
>>>> The operator needs to be registered, I've attached the skeleton for
>>>> it.  you should just have to finish implementing wi_fold().
>>>>
>>>> in theory :-)
>>>>
>>> You also mentioned earlier that some were tree codes, some were
>>> internal function calls?  We have some initial support for built in
>>> functions, but I am not familiar with all the various forms they can
>>> take.  We currently support CFN_ functions in
>>>
>>>    gimple-range-op.cc, gimple_range_op_handler::maybe_builtin_call ()
>>>
>>> Basically this is part of a "gimple_range_op_handler"  wrapper for
>>> range-ops which can provide a range-ops class for stmts that don't map
>>> to a binary or unary form.. such as built in functions.
>>>
>>> If you get to the point where you need this for a builtin function, I
>>> can help you through that too.  Although someone may have to also help
>>> me through what differentiates the different kinds of internal
>>> function :-)    I presume they are all similar in some way.
>>>
>>> Andrew
>>>
>> Oh yeah, and in case you haven't figured it out on your own, you'll have
>> to remove WIDEN_MULT_EXPR from the range-ops init table.   This
>> non-standard mechanism only gets checked if there is no standard
>> range-op table entry for the tree code :-P
>>
> Hmm it looks like it'll work, but it keeps segfaulting in:
>
> bool
> range_op_handler::fold_range (vrange &r, tree type,
> 			      const vrange &lh,
> 			      const vrange &rh,
> 			      relation_trio rel) const
> {
>    gcc_checking_assert (m_valid);
>    if (m_int)
>      return m_int->fold_range (as_a <irange> (r), type,
> 			   as_a <irange> (lh),
> 			   as_a <irange> (rh), rel);
>
> while trying to call fold_range.
>
> But m_int is set to the right instance. Probably something I'm missing,
> I'll double check it all.
>
Hmm.  whats your class operator_widen_mult* look like? what are you 
inheriting from?   Send me your patch and I'll have a look if you want.  
this is somewhat  new territory :-)

I cant imagine it being a linkage thing between the 2 files since the 
operator is defined in another file and the address taken in this one? 
that should work, but strange that cant make the call...

Andrew
Tamar Christina Feb. 23, 2023, 8:36 a.m. UTC | #30
Hi Andrew,

> >> Oh yeah, and in case you haven't figured it out on your own, you'll
> >> have to remove WIDEN_MULT_EXPR from the range-ops init table.   This
> >> non-standard mechanism only gets checked if there is no standard
> >> range-op table entry for the tree code :-P
> >>
> > Hmm it looks like it'll work, but it keeps segfaulting in:
> >
> > bool
> > range_op_handler::fold_range (vrange &r, tree type,
> > 			      const vrange &lh,
> > 			      const vrange &rh,
> > 			      relation_trio rel) const
> > {
> >    gcc_checking_assert (m_valid);
> >    if (m_int)
> >      return m_int->fold_range (as_a <irange> (r), type,
> > 			   as_a <irange> (lh),
> > 			   as_a <irange> (rh), rel);
> >
> > while trying to call fold_range.
> >
> > But m_int is set to the right instance. Probably something I'm
> > missing, I'll double check it all.
> >
> Hmm.  whats your class operator_widen_mult* look like? what are you
> inheriting from?   Send me your patch and I'll have a look if you want. this is
> somewhat  new territory :-)

I've attached the patch, and my testcase is:

int decMultiplyOp_zacc, decMultiplyOp_iacc;
int *decMultiplyOp_lp;
void decMultiplyOp() {
  decMultiplyOp_lp = &decMultiplyOp_zacc;
  for (; decMultiplyOp_lp < &decMultiplyOp_zacc + decMultiplyOp_iacc;
       decMultiplyOp_lp++)
    *decMultiplyOp_lp = 0;
}

And compiling with aarch64-none-elf-gcc -O2 zero.c -S -o - -Werror=stringop-overflow

Also to explain a bit on why we're only seeing this now:

The original sequence for most of the pipeline is based on a cast and multiplication

  # RANGE [irange] long unsigned int [0, 2147483647][18446744071562067968, +INF]
  _14 = (long unsigned intD.11) decMultiplyOp_iacc.2_13;
  # RANGE [irange] long unsigned int [0, 8589934588][18446744065119617024, 18446744073709551612] NONZERO 0xfffffffffffffffc
  _15 = _14 * 4;

But things like widening multiply are quite common, so some ISAs have it on scalars as well, not just vectors.
So there's a pass widening_mul that runs late for these targets.  This replaces the above with

  # RANGE [irange] long unsigned int [0, 8589934588][18446744065119617024, 18446744073709551612] NONZERO 0xfffffffffffffffc
  _15 = decMultiplyOp_iacc.2_13 w* 4;

And copies over the final range from the original expression.

After that there are passes like the warning passes that try to requery ranged to see if any optimization  has changed them.
Before my attempt to support *w this would just return VARYING and it would only use the old range.

Now however, without taking care to sign extend when appropriate the MIN range changes from a negative value to a large
positive one when we increase the precision.  So passes that re-query late get the wrong range.  That's why for instance in this case
we get an incorrect warning generated.

Thanks for the help!

Tamar

> 
> I cant imagine it being a linkage thing between the 2 files since the operator is
> defined in another file and the address taken in this one?
> that should work, but strange that cant make the call...
> 
> Andrew
Andrew MacLeod Feb. 23, 2023, 4:39 p.m. UTC | #31
On 2/23/23 03:36, Tamar Christina wrote:
> Hi Andrew,
>
>>>> Oh yeah, and in case you haven't figured it out on your own, you'll
>>>> have to remove WIDEN_MULT_EXPR from the range-ops init table.   This
>>>> non-standard mechanism only gets checked if there is no standard
>>>> range-op table entry for the tree code :-P
>>>>
>>> Hmm it looks like it'll work, but it keeps segfaulting in:
>>>
>>> bool
>>> range_op_handler::fold_range (vrange &r, tree type,
>>> 			      const vrange &lh,
>>> 			      const vrange &rh,
>>> 			      relation_trio rel) const
>>> {
>>>     gcc_checking_assert (m_valid);
>>>     if (m_int)
>>>       return m_int->fold_range (as_a <irange> (r), type,
>>> 			   as_a <irange> (lh),
>>> 			   as_a <irange> (rh), rel);
>>>
>>> while trying to call fold_range.
>>>
>>> But m_int is set to the right instance. Probably something I'm
>>> missing, I'll double check it all.
>>>
>> Hmm.  whats your class operator_widen_mult* look like? what are you
>> inheriting from?   Send me your patch and I'll have a look if you want. this is
>> somewhat  new territory :-)
> I've attached the patch, and my testcase is:
>
> int decMultiplyOp_zacc, decMultiplyOp_iacc;
> int *decMultiplyOp_lp;
> void decMultiplyOp() {
>    decMultiplyOp_lp = &decMultiplyOp_zacc;
>    for (; decMultiplyOp_lp < &decMultiplyOp_zacc + decMultiplyOp_iacc;
>         decMultiplyOp_lp++)
>      *decMultiplyOp_lp = 0;
> }
>
> And compiling with aarch64-none-elf-gcc -O2 zero.c -S -o - -Werror=stringop-overflow
>
> Also to explain a bit on why we're only seeing this now:
>
> The original sequence for most of the pipeline is based on a cast and multiplication
>
>    # RANGE [irange] long unsigned int [0, 2147483647][18446744071562067968, +INF]
>    _14 = (long unsigned intD.11) decMultiplyOp_iacc.2_13;
>    # RANGE [irange] long unsigned int [0, 8589934588][18446744065119617024, 18446744073709551612] NONZERO 0xfffffffffffffffc
>    _15 = _14 * 4;
>
> But things like widening multiply are quite common, so some ISAs have it on scalars as well, not just vectors.
> So there's a pass widening_mul that runs late for these targets.  This replaces the above with
>
>    # RANGE [irange] long unsigned int [0, 8589934588][18446744065119617024, 18446744073709551612] NONZERO 0xfffffffffffffffc
>    _15 = decMultiplyOp_iacc.2_13 w* 4;
>
> And copies over the final range from the original expression.
>
> After that there are passes like the warning passes that try to requery ranged to see if any optimization  has changed them.
> Before my attempt to support *w this would just return VARYING and it would only use the old range.
>
> Now however, without taking care to sign extend when appropriate the MIN range changes from a negative value to a large
> positive one when we increase the precision.  So passes that re-query late get the wrong range.  That's why for instance in this case
> we get an incorrect warning generated.
>
> Thanks for the help!
>
> Tamar
>
>> I cant imagine it being a linkage thing between the 2 files since the operator is
>> defined in another file and the address taken in this one?
>> that should work, but strange that cant make the call...
>>
>> Andrew

It is some sort of linkage/vtable thing.  The fix.diff patch applied on 
top of what you have will fix the fold issue. This'll do for now until I 
formalize how this is going to work goign forward.

Inheriting from operator_mult is also going to be hazardous because it 
also has an op1_range and op2_range...  you should at least define those 
and return VARYING to avoid other issues.  Same thing applies to 
widen_plus I think, and it has relation processing and other things as 
well.  Your widen operands are not what those classes expect, so I think 
you probably just want a fresh range operator.

It also looks like the mult operation is sign/zero extending both upper 
bounds, and neither lower bound..   I think that should be the LH upper 
and lower bound?

I've attached a second patch  (newversion.patch) which incorporates my 
fix, the fix to the sign of only op1's bounds,  as well as a 
simplification of the classes to not inherit from operator_mult/plus..   
I think this still does what you want?  and it wont get you into 
unexpected trouble later :-)

let me know if this is still doing what you are expecting...

Andrew
Tamar Christina Feb. 23, 2023, 4:56 p.m. UTC | #32
> -----Original Message-----
> From: Andrew MacLeod <amacleod@redhat.com>
> Sent: Thursday, February 23, 2023 4:40 PM
> To: Tamar Christina <Tamar.Christina@arm.com>; Richard Biener
> <rguenther@suse.de>; Richard Sandiford <Richard.Sandiford@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <nd@arm.com>; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> 
> On 2/23/23 03:36, Tamar Christina wrote:
> > Hi Andrew,
> >
> >>>> Oh yeah, and in case you haven't figured it out on your own, you'll
> >>>> have to remove WIDEN_MULT_EXPR from the range-ops init table.
> >>>> This non-standard mechanism only gets checked if there is no
> >>>> standard range-op table entry for the tree code :-P
> >>>>
> >>> Hmm it looks like it'll work, but it keeps segfaulting in:
> >>>
> >>> bool
> >>> range_op_handler::fold_range (vrange &r, tree type,
> >>> 			      const vrange &lh,
> >>> 			      const vrange &rh,
> >>> 			      relation_trio rel) const
> >>> {
> >>>     gcc_checking_assert (m_valid);
> >>>     if (m_int)
> >>>       return m_int->fold_range (as_a <irange> (r), type,
> >>> 			   as_a <irange> (lh),
> >>> 			   as_a <irange> (rh), rel);
> >>>
> >>> while trying to call fold_range.
> >>>
> >>> But m_int is set to the right instance. Probably something I'm
> >>> missing, I'll double check it all.
> >>>
> >> Hmm.  whats your class operator_widen_mult* look like? what are you
> >> inheriting from?   Send me your patch and I'll have a look if you
> >> want. this is somewhat  new territory :-)
> > I've attached the patch, and my testcase is:
> >
> > int decMultiplyOp_zacc, decMultiplyOp_iacc; int *decMultiplyOp_lp;
> > void decMultiplyOp() {
> >    decMultiplyOp_lp = &decMultiplyOp_zacc;
> >    for (; decMultiplyOp_lp < &decMultiplyOp_zacc + decMultiplyOp_iacc;
> >         decMultiplyOp_lp++)
> >      *decMultiplyOp_lp = 0;
> > }
> >
> > And compiling with aarch64-none-elf-gcc -O2 zero.c -S -o -
> > -Werror=stringop-overflow
> >
> > Also to explain a bit on why we're only seeing this now:
> >
> > The original sequence for most of the pipeline is based on a cast and
> > multiplication
> >
> >    # RANGE [irange] long unsigned int [0,
> 2147483647][18446744071562067968, +INF]
> >    _14 = (long unsigned intD.11) decMultiplyOp_iacc.2_13;
> >    # RANGE [irange] long unsigned int [0,
> 8589934588][18446744065119617024, 18446744073709551612]
> NONZERO 0xfffffffffffffffc
> >    _15 = _14 * 4;
> >
> > But things like widening multiply are quite common, so some ISAs have it on
> scalars as well, not just vectors.
> > So there's a pass widening_mul that runs late for these targets.  This
> > replaces the above with
> >
> >    # RANGE [irange] long unsigned int [0,
> 8589934588][18446744065119617024, 18446744073709551612]
> NONZERO 0xfffffffffffffffc
> >    _15 = decMultiplyOp_iacc.2_13 w* 4;
> >
> > And copies over the final range from the original expression.
> >
> > After that there are passes like the warning passes that try to requery ranged
> to see if any optimization  has changed them.
> > Before my attempt to support *w this would just return VARYING and it
> would only use the old range.
> >
> > Now however, without taking care to sign extend when appropriate the
> > MIN range changes from a negative value to a large positive one when
> > we increase the precision.  So passes that re-query late get the wrong range.
> That's why for instance in this case we get an incorrect warning generated.
> >
> > Thanks for the help!
> >
> > Tamar
> >
> >> I cant imagine it being a linkage thing between the 2 files since the
> >> operator is defined in another file and the address taken in this one?
> >> that should work, but strange that cant make the call...
> >>
> >> Andrew
> 
> It is some sort of linkage/vtable thing.  The fix.diff patch applied on top of
> what you have will fix the fold issue. This'll do for now until I formalize how this
> is going to work goign forward.

Ah, I did see gdb warning about the vtable 😊

> 
> Inheriting from operator_mult is also going to be hazardous because it also
> has an op1_range and op2_range...  you should at least define those and
> return VARYING to avoid other issues.  Same thing applies to widen_plus I
> think, and it has relation processing and other things as well.  Your widen
> operands are not what those classes expect, so I think you probably just want
> a fresh range operator.
> 
> It also looks like the mult operation is sign/zero extending both upper bounds,
> and neither lower bound..   I think that should be the LH upper and lower
> bound?

Ah yes, that was a typo.

> 
> I've attached a second patch  (newversion.patch) which incorporates my fix,
> the fix to the sign of only op1's bounds,  as well as a simplification of the
> classes to not inherit from operator_mult/plus.. I think this still does what you
> want?  and it wont get you into unexpected trouble later :-)
> 
> let me know if this is still doing what you are expecting...

Yes it was! And works perfectly.  I think I'll need the same for widen_plus, so I'll
make those changes and do full regression run and submit the finished patch.

Thanks for all the help!

Cheers,
Tamar
> 
> Andrew
Tamar Christina Feb. 27, 2023, 11:09 a.m. UTC | #33
Hi,

> > I avoided open coding it with add and shift because it creates a 4
> > instructions (and shifts which are typically slow) dependency chain
> > instead of a load and multiply.  This change, unless the target is
> > known to optimize it further is unlikely to be beneficial.  And by the
> > time we get to costing the only alternative is to undo the existing pattern and
> so you lose the general shift optimization.
> >
> > So it seemed unwise to open code as shifts, given the codegen out of
> > the vectorizer would be degenerate for most targets or one needs the
> > more complicated route of costing during pattern matching already.
> 
> Hmm, OK.  That seems like a cost-model thing though, rather than something
> that should be exposed through optabs.  And I imagine the open-coded
> version would still be better than nothing on targets without highpart multiply.
> 
> So how about replacing the hook with one that simply asks whether division
> through highpart multiplication is preferred over the add/shift sequence?
> (Unfortunately it's not going to be possible to work that out from existing
> information.)

So this doesn't work for SVE.  For SVE the multiplication widening pass introduces
FMAs at gimple level.  So in the cases where the operation is fed from a widening
multiplication we end up generating FMA.  If that was it I could have matched FMA.

But it also pushes the multiplication in the second operand because it no longer has
a mul to share the results with.

In any case, the gimple code is transformed into

vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121);
vect_patt_57.9_123 = (vector([8,8]) unsigned short) vect__3.8_122;
vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, { 257, ... });
vect_patt_65.12_128 = vect_patt_64.11_127 >> 8;
vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124, vect_patt_65.12_128);
vect_patt_62.14_130 = vect_patt_66.13_129 >> 8;
vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21) vect_patt_62.14_130;

This transformation is much worse than the original code, it extended the dependency
chain with another expensive instruction. I can try to correct this in RTL by matching
FMA and shift and splitting into MUL + ADDHNB and hope CSE takes care of the extra mul.

But this seems like a hack, and it's basically undoing the earlier transformation.  It seems to
me that the open coding is a bad idea.

Do you still want it Richard?

Thanks,
Tamar
> 
> Thanks,
> Richard
> 
> >
> >>
> >> Some comments in addition to Richard's:
> >>
> >> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> > Hi All,
> >> >
> >> > As discussed in the ticket, this replaces the approach for
> >> > optimizing the div by bitmask operation from a hook into optabs
> >> > implemented through add_highpart.
> >> >
> >> > In order to be able to use this we need to check whether the
> >> > current precision has enough bits to do the operation without any
> >> > of the additions
> >> overflowing.
> >> >
> >> > We use range information to determine this and only do the
> >> > operation if we're sure am overflow won't occur.
> >> >
> >> > Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going>
> >> issues.
> >> >
> >> > Ok for master?
> >> >
> >> > Thanks,
> >> > Tamar
> >> >
> >> > gcc/ChangeLog:
> >> >
> >> > 	PR target/108583
> >> > 	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST):
> >> Remove.
> >> > 	* doc/tm.texi.in: Likewise.
> >> > 	* explow.cc (round_push, align_dynamic_address): Revert previous
> >> patch.
> >> > 	* expmed.cc (expand_divmod): Likewise.
> >> > 	* expmed.h (expand_divmod): Likewise.
> >> > 	* expr.cc (force_operand, expand_expr_divmod): Likewise.
> >> > 	* optabs.cc (expand_doubleword_mod,
> >> expand_doubleword_divmod): Likewise.
> >> > 	* internal-fn.def (ADDH): New.
> >> > 	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
> >> > 	* doc/md.texi: Document them.
> >> > 	* doc/rtl.texi: Likewise.
> >> > 	* target.def (can_special_div_by_const): Remove.
> >> > 	* target.h: Remove tree-core.h include
> >> > 	* targhooks.cc (default_can_special_div_by_const): Remove.
> >> > 	* targhooks.h (default_can_special_div_by_const): Remove.
> >> > 	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
> >> > 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook
> >> and
> >> > 	implement new obtab recognition based on range.
> >> > 	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.
> >> >
> >> > gcc/testsuite/ChangeLog:
> >> >
> >> > 	PR target/108583
> >> > 	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
> >> > 	* gcc.dg/vect/vect-div-bitmask-5.c: New test.
> >> >
> >> > --- inline copy of patch --
> >> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index
> >> >
> >>
> 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f
> 74080
> >> 3
> >> > 8595e21af35d 100644
> >> > --- a/gcc/doc/md.texi
> >> > +++ b/gcc/doc/md.texi
> >> > @@ -5668,6 +5668,18 @@ represented in RTL using a
> >> @code{smul_highpart} RTX expression.
> >> >  Similar, but the multiplication is unsigned.  This may be
> >> > represented in RTL using an @code{umul_highpart} RTX expression.
> >> >
> >> > +@cindex @code{sadd@var{m}3_highpart} instruction pattern @item
> >> > +@samp{smul@var{m}3_highpart}
> >>
> >> sadd
> >>
> >> > +Perform a signed addition of operands 1 and 2, which have mode
> >> > +@var{m}, and store the most significant half of the product in operand
> 0.
> >> > +The least significant half of the product is discarded.  This may
> >> > +be represented in RTL using a @code{sadd_highpart} RTX expression.
> >> > +
> >> > +@cindex @code{uadd@var{m}3_highpart} instruction pattern @item
> >> > +@samp{uadd@var{m}3_highpart} Similar, but the addition is unsigned.
> >> > +This may be represented in RTL using an @code{uadd_highpart} RTX
> >> > +expression.
> >> > +
> >> >  @cindex @code{madd@var{m}@var{n}4} instruction pattern  @item
> >> > @samp{madd@var{m}@var{n}4}  Multiply operands 1 and 2, sign-
> extend
> >> > them to mode @var{n}, add diff --git a/gcc/doc/rtl.texi
> >> > b/gcc/doc/rtl.texi index
> >> >
> >>
> d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00
> 343
> >> d17
> >> > 1940ec4222f3 100644
> >> > --- a/gcc/doc/rtl.texi
> >> > +++ b/gcc/doc/rtl.texi
> >> > @@ -2535,6 +2535,17 @@ out in machine mode @var{m}.
> >> > @code{smul_highpart} returns the high part  of a signed
> >> > multiplication, @code{umul_highpart} returns the high part  of an
> >> > unsigned
> >> multiplication.
> >> >
> >> > +@findex sadd_highpart
> >> > +@findex uadd_highpart
> >> > +@cindex high-part addition
> >> > +@cindex addition high part
> >> > +@item (sadd_highpart:@var{m} @var{x} @var{y}) @itemx
> >> > +(uadd_highpart:@var{m} @var{x} @var{y}) Represents the high-part
> >> > +addition of @var{x} and @var{y} carried out in machine mode @var{m}.
> >> > +@code{sadd_highpart} returns the high part of a signed addition,
> >> > +@code{uadd_highpart} returns the high part of an unsigned addition.
> >>
> >> The patch doesn't add these RTL codes though.
> >>
> >> > +
> >> >  @findex fma
> >> >  @cindex fused multiply-add
> >> >  @item (fma:@var{m} @var{x} @var{y} @var{z}) diff --git
> >> > a/gcc/doc/tm.texi b/gcc/doc/tm.texi index
> >> >
> >>
> c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d57
> 914840
> >> 17e
> >> > 6b0d62ab077e 100644
> >> > --- a/gcc/doc/tm.texi
> >> > +++ b/gcc/doc/tm.texi
> >> > @@ -6137,22 +6137,6 @@ instruction pattern.  There is no need for
> >> > the hook to handle these two  implementation approaches itself.
> >> >  @end deftypefn
> >> >
> >> > -@deftypefn {Target Hook} bool
> >> > TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum
> >> @var{tree_code}, tree
> >> > @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx
> >> > @var{in0}, rtx @var{in1}) -This hook is used to test whether the
> >> > target has a special method of -division of vectors of type
> >> > @var{vectype}
> >> using the value @var{constant}, -and producing a vector of type
> >> @var{vectype}.  The division -will then not be decomposed by the
> >> vectorizer and kept as a div.
> >> > -
> >> > -When the hook is being used to test whether the target supports a
> >> > special -divide, @var{in0}, @var{in1}, and @var{output} are all null.
> >> > When the hook -is being used to emit a division, @var{in0} and
> >> > @var{in1} are the source -vectors of type @var{vecttype} and
> >> > @var{output} is the destination vector of -type @var{vectype}.
> >> > -
> >> > -Return true if the operation is possible, emitting instructions
> >> > for it -if rtxes are provided and updating @var{output}.
> >> > -@end deftypefn
> >> > -
> >> >  @deftypefn {Target Hook} tree
> >> > TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned
> >> @var{code},
> >> > tree @var{vec_type_out}, tree @var{vec_type_in})  This hook should
> >> > return the decl of a function that implements the  vectorized
> >> > variant of the function with the @code{combined_fn} code diff --git
> >> > a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index
> >> >
> >>
> 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d
> 1efec0
> >> a3a
> >> > bccd1c293c7b 100644
> >> > --- a/gcc/doc/tm.texi.in
> >> > +++ b/gcc/doc/tm.texi.in
> >> > @@ -4173,8 +4173,6 @@ address;  but often a machine-dependent
> >> strategy can generate better code.
> >> >
> >> >  @hook TARGET_VECTORIZE_VEC_PERM_CONST
> >> >
> >> > -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
> >> > -
> >> >  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
> >> >
> >> >  @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
> >> > diff --git a/gcc/explow.cc b/gcc/explow.cc index
> >> >
> >>
> 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc
> 212f0
> >> bef
> >> > a016eea4573c 100644
> >> > --- a/gcc/explow.cc
> >> > +++ b/gcc/explow.cc
> >> > @@ -1037,7 +1037,7 @@ round_push (rtx size)
> >> >       TRUNC_DIV_EXPR.  */
> >> >    size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
> >> >  		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
> >> > -  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
> >> > size, align_rtx,
> >> > +  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
> >> >  			NULL_RTX, 1);
> >> >    size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
> >> >
> >> > @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned
> >> required_align)
> >> >  			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
> >> >  				       Pmode),
> >> >  			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
> >> > -  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
> >> > target,
> >> > +  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
> >> >  			  gen_int_mode (required_align / BITS_PER_UNIT,
> >> >  					Pmode),
> >> >  			  NULL_RTX, 1);
> >> > diff --git a/gcc/expmed.h b/gcc/expmed.h index
> >> >
> >>
> 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c
> 5364
> >> 094
> >> > 1628068f3901 100644
> >> > --- a/gcc/expmed.h
> >> > +++ b/gcc/expmed.h
> >> > @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code,
> >> > machine_mode, rtx, poly_int64, rtx,  extern rtx maybe_expand_shift
> >> (enum tree_code, machine_mode, rtx, int, rtx,
> >> >  			       int);
> >> >  #ifdef GCC_OPTABS_H
> >> > -extern rtx expand_divmod (int, enum tree_code, machine_mode, tree,
> >> tree,
> >> > -			  rtx, rtx, rtx, int,
> >> > -			  enum optab_methods = OPTAB_LIB_WIDEN);
> >> > +extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx,
> >> rtx,
> >> > +			  rtx, int, enum optab_methods =
> >> OPTAB_LIB_WIDEN);
> >> >  #endif
> >> >  #endif
> >> >
> >> > diff --git a/gcc/expmed.cc b/gcc/expmed.cc index
> >> >
> >>
> 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025a
> b18a3
> >> a59
> >> > c169d3b7692f 100644
> >> > --- a/gcc/expmed.cc
> >> > +++ b/gcc/expmed.cc
> >> > @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode,
> rtx
> >> op0,
> >> > HOST_WIDE_INT d)
> >> >
> >> >  rtx
> >> >  expand_divmod (int rem_flag, enum tree_code code, machine_mode
> >> mode,
> >> > -	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
> >> > -	       int unsignedp, enum optab_methods methods)
> >> > +	       rtx op0, rtx op1, rtx target, int unsignedp,
> >> > +	       enum optab_methods methods)
> >> >  {
> >> >    machine_mode compute_mode;
> >> >    rtx tquotient;
> >> > @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum
> tree_code
> >> > code, machine_mode mode,
> >> >
> >> >    last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) :
> >> > 0;
> >> >
> >> > -  /* Check if the target has specific expansions for the division.
> >> > */
> >> > -  tree cst;
> >> > -  if (treeop0
> >> > -      && treeop1
> >> > -      && (cst = uniform_integer_cst_p (treeop1))
> >> > -      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE
> >> (treeop0),
> >> > -						     wi::to_wide (cst),
> >> > -						     &target, op0, op1))
> >> > -    return target;
> >> > -
> >> > -
> >> >    /* Now convert to the best mode to use.  */
> >> >    if (compute_mode != mode)
> >> >      {
> >> > @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum
> tree_code
> >> code, machine_mode mode,
> >> >  			    || (optab_handler (sdivmod_optab, int_mode)
> >> >  				!= CODE_FOR_nothing)))
> >> >  		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
> >> > -						int_mode, treeop0, treeop1,
> >> > -						op0, gen_int_mode (abs_d,
> >> > +						int_mode, op0,
> >> > +						gen_int_mode (abs_d,
> >> >  							      int_mode),
> >> >  						NULL_RTX, 0);
> >> >  		    else
> >> > @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum
> tree_code
> >> code, machine_mode mode,
> >> >  				      size - 1, NULL_RTX, 0);
> >> >  		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
> >> >  				    NULL_RTX);
> >> > -		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode,
> >> treeop0,
> >> > -				    treeop1, t3, op1, NULL_RTX, 0);
> >> > +		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3,
> >> op1,
> >> > +				    NULL_RTX, 0);
> >> >  		if (t4)
> >> >  		  {
> >> >  		    rtx t5;
> >> > diff --git a/gcc/expr.cc b/gcc/expr.cc index
> >> >
> >>
> 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e75521633907
> 8d5b
> >> 2280
> >> > c6e277f26d72 100644
> >> > --- a/gcc/expr.cc
> >> > +++ b/gcc/expr.cc
> >> > @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
> >> >  	    return expand_divmod (0,
> >> >  				  FLOAT_MODE_P (GET_MODE (value))
> >> >  				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
> >> > -				  GET_MODE (value), NULL, NULL, op1, op2,
> >> > -				  target, 0);
> >> > +				  GET_MODE (value), op1, op2, target, 0);
> >> >  	case MOD:
> >> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> >> NULL, NULL,
> >> > -				op1, op2, target, 0);
> >> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> >> op1, op2,
> >> > +				target, 0);
> >> >  	case UDIV:
> >> > -	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
> >> NULL, NULL,
> >> > -				op1, op2, target, 1);
> >> > +	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
> >> op1, op2,
> >> > +				target, 1);
> >> >  	case UMOD:
> >> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> >> NULL, NULL,
> >> > -				op1, op2, target, 1);
> >> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
> >> op1, op2,
> >> > +				target, 1);
> >> >  	case ASHIFTRT:
> >> >  	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
> >> >  				      target, 0, OPTAB_LIB_WIDEN); @@ -
> >> 9170,13 +9169,11 @@
> >> > expand_expr_divmod (tree_code code, machine_mode mode, tree
> >> treeop0,
> >> >        bool speed_p = optimize_insn_for_speed_p ();
> >> >        do_pending_stack_adjust ();
> >> >        start_sequence ();
> >> > -      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> >> > -				   op0, op1, target, 1);
> >> > +      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1,
> >> > + target, 1);
> >> >        rtx_insn *uns_insns = get_insns ();
> >> >        end_sequence ();
> >> >        start_sequence ();
> >> > -      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> >> > -				   op0, op1, target, 0);
> >> > +      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1,
> >> > + target, 0);
> >> >        rtx_insn *sgn_insns = get_insns ();
> >> >        end_sequence ();
> >> >        unsigned uns_cost = seq_cost (uns_insns, speed_p); @@
> >> > -9198,8
> >> > +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode
> >> mode, tree treeop0,
> >> >        emit_insn (sgn_insns);
> >> >        return sgn_ret;
> >> >      }
> >> > -  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
> >> > -			op0, op1, target, unsignedp);
> >> > +  return expand_divmod (mod_p, code, mode, op0, op1, target,
> >> > + unsignedp);
> >> >  }
> >> >
> >> >  rtx
> >> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
> >> >
> >>
> 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052
> 584f5a
> >> 3b
> >> > 8a734baa800f 100644
> >> > --- a/gcc/internal-fn.def
> >> > +++ b/gcc/internal-fn.def
> >> > @@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL,
> >> ECF_CONST
> >> > | ECF_NOTHROW, first,
> >> >
> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST |
> >> ECF_NOTHROW, first,
> >> >  			      smul_highpart, umul_highpart, binary)
> >> > +DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST |
> >> ECF_NOTHROW, first,
> >> > +			      sadd_highpart, uadd_highpart, binary)
> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST |
> >> ECF_NOTHROW, first,
> >> >  			      smulhs, umulhs, binary)
> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST |
> >> ECF_NOTHROW, first,
> >> > diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
> >> >
> >>
> cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe03
> 69a6
> >> e
> >> > 77082c1e617b 100644
> >> > --- a/gcc/optabs.cc
> >> > +++ b/gcc/optabs.cc
> >> > @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode
> >> mode, rtx op0, rtx op1, bool unsignedp)
> >> >  		return NULL_RTX;
> >> >  	    }
> >> >  	}
> >> > -      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode,
> >> NULL, NULL,
> >> > -				     sum, gen_int_mode (INTVAL (op1),
> >> > -							word_mode),
> >> > +      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR,
> word_mode,
> >> sum,
> >> > +				     gen_int_mode (INTVAL (op1),
> >> word_mode),
> >> >  				     NULL_RTX, 1, OPTAB_DIRECT);
> >> >        if (remainder == NULL_RTX)
> >> >  	return NULL_RTX;
> >> > @@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode
> >> mode, rtx
> >> > op0, rtx op1, rtx *rem,
> >> >
> >> >    if (op11 != const1_rtx)
> >> >      {
> >> > -      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL,
> NULL,
> >> quot1,
> >> > -				op11, NULL_RTX, unsignedp,
> >> OPTAB_DIRECT);
> >> > +      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1,
> >> op11,
> >> > +				NULL_RTX, unsignedp, OPTAB_DIRECT);
> >> >        if (rem2 == NULL_RTX)
> >> >  	return NULL_RTX;
> >> >
> >> > @@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode
> >> mode, rtx op0, rtx op1, rtx *rem,
> >> >        if (rem2 == NULL_RTX)
> >> >  	return NULL_RTX;
> >> >
> >> > -      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL,
> NULL,
> >> quot1,
> >> > -				 op11, NULL_RTX, unsignedp,
> >> OPTAB_DIRECT);
> >> > +      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1,
> op11,
> >> > +				 NULL_RTX, unsignedp, OPTAB_DIRECT);
> >> >        if (quot2 == NULL_RTX)
> >> >  	return NULL_RTX;
> >> >
> >> > diff --git a/gcc/optabs.def b/gcc/optabs.def index
> >> >
> >>
> 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2
> d7a5
> >> ccb
> >> > f6147947351a 100644
> >> > --- a/gcc/optabs.def
> >> > +++ b/gcc/optabs.def
> >> > @@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
> >> >
> >> >  OPTAB_D (smul_highpart_optab, "smul$a3_highpart")  OPTAB_D
> >> > (umul_highpart_optab, "umul$a3_highpart")
> >> > +OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart") OPTAB_D
> >> > +(uadd_highpart_optab, "uadd$a3_highpart")
> >> >
> >> >  OPTAB_D (cmpmem_optab, "cmpmem$a")  OPTAB_D (cmpstr_optab,
> >> > "cmpstr$a") diff --git a/gcc/target.def b/gcc/target.def index
> >> >
> >>
> db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed0
> 8d1d
> >> 81a
> >> > fa2c2baa64a5 100644
> >> > --- a/gcc/target.def
> >> > +++ b/gcc/target.def
> >> > @@ -1905,25 +1905,6 @@ implementation approaches itself.",
> >> >  	const vec_perm_indices &sel),
> >> >   NULL)
> >> >
> >> > -DEFHOOK
> >> > -(can_special_div_by_const,
> >> > - "This hook is used to test whether the target has a special
> >> > method of\n\ -division of vectors of type @var{vectype} using the
> >> > value @var{constant},\n\ -and producing a vector of type
> >> > @var{vectype}.  The division\n\ -will then not be decomposed by the
> >> > vectorizer and kept as a div.\n\ -\n\ -When the hook is being used
> >> > to test whether the target supports a special\n\ -divide,
> >> > @var{in0}, @var{in1}, and @var{output} are all null.  When the
> >> > hook\n\ -is being used to emit a division, @var{in0} and @var{in1}
> >> > are the source\n\ -vectors of type @var{vecttype} and @var{output}
> >> > is the destination vector of\n\ -type @var{vectype}.\n\ -\n\
> >> > -Return true if the operation is possible, emitting instructions
> >> > for it\n\ -if rtxes are provided and updating @var{output}.",
> >> > - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
> >> > -	rtx in0, rtx in1),
> >> > - default_can_special_div_by_const)
> >> > -
> >> >  /* Return true if the target supports misaligned store/load of a
> >> >     specific factor denoted in the third parameter.  The last parameter
> >> >     is true if the access is defined in a packed struct.  */ diff
> >> > --git a/gcc/target.h b/gcc/target.h index
> >> >
> >>
> 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fc
> a82b9
> >> 9f9
> >> > 13158c2d47b1 100644
> >> > --- a/gcc/target.h
> >> > +++ b/gcc/target.h
> >> > @@ -51,7 +51,6 @@
> >> >  #include "insn-codes.h"
> >> >  #include "tm.h"
> >> >  #include "hard-reg-set.h"
> >> > -#include "tree-core.h"
> >> >
> >> >  #if CHECKING_P
> >> >
> >> > diff --git a/gcc/targhooks.h b/gcc/targhooks.h index
> >> >
> >>
> a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad2
> 24454
> >> 93
> >> > 17a31390f0c2 100644
> >> > --- a/gcc/targhooks.h
> >> > +++ b/gcc/targhooks.h
> >> > @@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage
> >> > (addr_space_t, location_t);  extern rtx default_addr_space_convert
> >> > (rtx, tree, tree);  extern unsigned int
> >> > default_case_values_threshold (void);  extern bool
> >> > default_have_conditional_execution (void); -extern bool
> >> > default_can_special_div_by_const (enum tree_code, tree,
> >> wide_int,
> >> > -					      rtx *, rtx, rtx);
> >> >
> >> >  extern bool default_libc_has_function (enum function_class, tree);
> >> > extern bool default_libc_has_fast_function (int fcode); diff --git
> >> > a/gcc/targhooks.cc b/gcc/targhooks.cc index
> >> >
> >>
> fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2
> da91e
> >> 03
> >> > 877337a931e7 100644
> >> > --- a/gcc/targhooks.cc
> >> > +++ b/gcc/targhooks.cc
> >> > @@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
> >> >    return HAVE_conditional_execution;  }
> >> >
> >> > -/* Default that no division by constant operations are special.
> >> > */ -bool -default_can_special_div_by_const (enum tree_code, tree,
> >> > wide_int, rtx *, rtx,
> >> > -				  rtx)
> >> > -{
> >> > -  return false;
> >> > -}
> >> > -
> >> >  /* By default we assume that c99 functions are present at the runtime,
> >> >     but sincos is not.  */
> >> >  bool
> >> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> >> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> >> > new file mode 100644
> >> > index
> >> >
> >>
> 0000000000000000000000000000000000000000..c81f8946922250234b
> f759e0a0
> >> a0
> >> > 4ea8c1f73e3c
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> >> > @@ -0,0 +1,25 @@
> >> > +/* { dg-require-effective-target vect_int } */
> >> > +
> >> > +#include <stdint.h>
> >> > +#include "tree-vect.h"
> >> > +
> >> > +typedef unsigned __attribute__((__vector_size__ (16))) V;
> >> > +
> >> > +static __attribute__((__noinline__)) __attribute__((__noclone__))
> >> > +V foo (V v, unsigned short i) {
> >> > +  v /= i;
> >> > +  return v;
> >> > +}
> >> > +
> >> > +int
> >> > +main (void)
> >> > +{
> >> > +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff
> >> > +}, 0xffff);
> >> > +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
> >> > +    if (v[i] != 0x00010001)
> >> > +      __builtin_abort ();
> >> > +  return 0;
> >> > +}
> >> > +
> >> > +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern:
> >> > +detected" "vect" { target aarch64*-*-* } } } */
> >> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> >> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> >> > new file mode 100644
> >> > index
> >> >
> >>
> 0000000000000000000000000000000000000000..b4eb1a4dacba481e63
> 06b4991
> >> 4d2
> >> > a29b933de625
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> >> > @@ -0,0 +1,58 @@
> >> > +/* { dg-require-effective-target vect_int } */
> >> > +
> >> > +#include <stdint.h>
> >> > +#include <stdio.h>
> >> > +#include "tree-vect.h"
> >> > +
> >> > +#define N 50
> >> > +#define TYPE uint8_t
> >> > +
> >> > +#ifndef DEBUG
> >> > +#define DEBUG 0
> >> > +#endif
> >> > +
> >> > +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> >> > +
> >> > +
> >> > +__attribute__((noipa, noinline, optimize("O1"))) void fun1(TYPE*
> >> > +restrict pixel, TYPE level, int n) {
> >> > +  for (int i = 0; i < n; i+=1)
> >> > +    pixel[i] = (pixel[i] + level) / 0xff; }
> >> > +
> >> > +__attribute__((noipa, noinline, optimize("O3"))) void fun2(TYPE*
> >> > +restrict pixel, TYPE level, int n) {
> >> > +  for (int i = 0; i < n; i+=1)
> >> > +    pixel[i] = (pixel[i] + level) / 0xff; }
> >> > +
> >> > +int main ()
> >> > +{
> >> > +  TYPE a[N];
> >> > +  TYPE b[N];
> >> > +
> >> > +  for (int i = 0; i < N; ++i)
> >> > +    {
> >> > +      a[i] = BASE + i * 13;
> >> > +      b[i] = BASE + i * 13;
> >> > +      if (DEBUG)
> >> > +        printf ("%d: 0x%x\n", i, a[i]);
> >> > +    }
> >> > +
> >> > +  fun1 (a, N / 2, N);
> >> > +  fun2 (b, N / 2, N);
> >> > +
> >> > +  for (int i = 0; i < N; ++i)
> >> > +    {
> >> > +      if (DEBUG)
> >> > +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> >> > +
> >> > +      if (a[i] != b[i])
> >> > +        __builtin_abort ();
> >> > +    }
> >> > +  return 0;
> >> > +}
> >> > +
> >> > +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect"
> >> > +{ target aarch64*-*-* } } } */
> >> > diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
> >> > index
> >> >
> >>
> 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c
> 14077d
> >> c3
> >> > e970bed75ef6 100644
> >> > --- a/gcc/tree-vect-generic.cc
> >> > +++ b/gcc/tree-vect-generic.cc
> >> > @@ -1237,17 +1237,6 @@ expand_vector_operation
> >> (gimple_stmt_iterator *gsi, tree type, tree compute_type
> >> >  	  tree rhs2 = gimple_assign_rhs2 (assign);
> >> >  	  tree ret;
> >> >
> >> > -	  /* Check if the target was going to handle it through the special
> >> > -	     division callback hook.  */
> >> > -	  tree cst = uniform_integer_cst_p (rhs2);
> >> > -	  if (cst &&
> >> > -	      targetm.vectorize.can_special_div_by_const (code, type,
> >> > -							  wi::to_wide (cst),
> >> > -							  NULL,
> >> > -							  NULL_RTX,
> >> NULL_RTX))
> >> > -	    return NULL_TREE;
> >> > -
> >> > -
> >> >  	  if (!optimize
> >> >  	      || !VECTOR_INTEGER_TYPE_P (type)
> >> >  	      || TREE_CODE (rhs2) != VECTOR_CST diff --git
> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
> >> >
> >>
> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc46
> 7f33
> >> 69
> >> > de2afea139d6 100644
> >> > --- a/gcc/tree-vect-patterns.cc
> >> > +++ b/gcc/tree-vect-patterns.cc
> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info
> *vinfo,
> >> >        return pattern_stmt;
> >> >      }
> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
> >> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
> >> vectype,
> >> > -							  wi::to_wide (cst),
> >> > -							  NULL, NULL_RTX,
> >> > -							  NULL_RTX))
> >> > +	   && TYPE_UNSIGNED (itype)
> >> > +	   && rhs_code == TRUNC_DIV_EXPR
> >> > +	   && vectype
> >> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> >> > +					      OPTIMIZE_FOR_SPEED))
> >> >      {
> >> > -      return NULL;
> >> > +      /* div optimizations using narrowings
> >> > +       we can do the division e.g. shorts by 255 faster by calculating it as
> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> >> > +       double the precision of x.
> >> > +
> >> > +       If we imagine a short as being composed of two blocks of bytes then
> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is
> equivalent to
> >> > +       adding 1 to each sub component:
> >> > +
> >> > +	    short value of 16-bits
> >> > +       ┌──────────────┬────────────────┐
> >> > +       │              │                │
> >> > +       └──────────────┴────────────────┘
> >> > +	 8-bit part1 ▲  8-bit part2   ▲
> >> > +		     │                │
> >> > +		     │                │
> >> > +		    +1               +1
> >> > +
> >> > +       after the first addition, we have to shift right by 8, and narrow the
> >> > +       results back to a byte.  Remember that the addition must be done in
> >> > +       double the precision of the input.  However if we know that
> >> > + the
> >> addition
> >> > +       `x + 257` does not overflow then we can do the operation in
> >> > + the
> >> current
> >> > +       precision.  In which case we don't need the pack and unpacks.  */
> >> > +      auto wcst = wi::to_wide (cst);
> >> > +      int pow = wi::exact_log2 (wcst + 1);
> >> > +      if (pow == (int) (element_precision (vectype) / 2))
> >> > +	{
> >> > +	  wide_int min,max;
> >> > +	  /* If we're in a pattern we need to find the orginal definition.  */
> >> > +	  tree op0 = oprnd0;
> >> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> >> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> >> > +	  if (is_pattern_stmt_p (stmt_info))
> >> > +	    {
> >> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
> >> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> >> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
> >> > +	    }
> >>
> >> If this is generally safe (I'm skipping thinking about it in the
> >> interests of a quick review :-)), then I think it should be done in
> >> vect_get_range_info instead.  Using gimple_get_lhs would be more
> >> general than handling just assignments.
> >>
> >> > +
> >> > +	  /* Check that no overflow will occur.  If we don't have range
> >> > +	     information we can't perform the optimization.  */
> >> > +	  if (vect_get_range_info (op0, &min, &max))
> >> > +	    {
> >> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
> >> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> >> > +	      wi::overflow_type ovf;
> >> > +	      /* We need adder and max in the same precision.  */
> >> > +	      wide_int zadder
> >> > +		= wide_int_storage::from (adder, wi::get_precision (max),
> >> > +					  UNSIGNED);
> >> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
> >>
> >> Could you explain this a bit more?  When do we have mismatched
> >> precisions?
> >
> > C promotion rules will promote e.g.
> >
> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
> >   for (int i = 0; i < n; i+=1)
> >     pixel[i] = (pixel[i] + level) / 0xff; }
> >
> > And have the addition be done as a 32 bit integer.  The vectorizer
> > will demote this down to a short, but range information is not stored
> > for patterns.  So In the above the range will correctly be 0x1fe but
> > the precision will be that of the original expression, so 32.  This
> > will be a mismatch with itype which is derived from the size the vectorizer
> will perform the operation in.
> >
> > Thanks,
> > Tamar
> >
> >>
> >> Thanks,
> >> Richard
> >>
> >> > +	      if (ovf == wi::OVF_NONE)
> >> > +		{
> >> > +		  *type_out = vectype;
> >> > +		  tree tadder = wide_int_to_tree (itype, adder);
> >> > +		  gcall *patt1
> >> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0,
> >> tadder);
> >> > +		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
> >> > +		  gimple_call_set_lhs (patt1, lhs);
> >> > +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1,
> >> vectype);
> >> > +
> >> > +		  pattern_stmt
> >> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
> >> > +		  lhs = vect_recog_temp_ssa_var (itype, NULL);
> >> > +		  gimple_call_set_lhs (pattern_stmt, lhs);
> >> > +
> >> > +		  return pattern_stmt;
> >> > +		}
> >> > +	    }
> >> > +	}
> >> >      }
> >> >
> >> >    if (prec > HOST_BITS_PER_WIDE_INT diff --git
> >> > a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index
> >> >
> >>
> eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0
> b95
> >> 64f
> >> > c4e066e50081 100644
> >> > --- a/gcc/tree-vect-stmts.cc
> >> > +++ b/gcc/tree-vect-stmts.cc
> >> > @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
> >> >  	}
> >> >        target_support_p = (optab_handler (optab, vec_mode)
> >> >  			  != CODE_FOR_nothing);
> >> > -      tree cst;
> >> > -      if (!target_support_p
> >> > -	  && op1
> >> > -	  && (cst = uniform_integer_cst_p (op1)))
> >> > -	target_support_p
> >> > -	  = targetm.vectorize.can_special_div_by_const (code, vectype,
> >> > -							wi::to_wide (cst),
> >> > -							NULL, NULL_RTX,
> >> > -							NULL_RTX);
> >> >      }
> >> >
> >> >    bool using_emulated_vectors_p = vect_emulated_vector_p
> >> > (vectype);
Richard Sandiford Feb. 27, 2023, 12:11 p.m. UTC | #34
Tamar Christina <Tamar.Christina@arm.com> writes:
> Hi,
>
>> > I avoided open coding it with add and shift because it creates a 4
>> > instructions (and shifts which are typically slow) dependency chain
>> > instead of a load and multiply.  This change, unless the target is
>> > known to optimize it further is unlikely to be beneficial.  And by the
>> > time we get to costing the only alternative is to undo the existing pattern and
>> so you lose the general shift optimization.
>> >
>> > So it seemed unwise to open code as shifts, given the codegen out of
>> > the vectorizer would be degenerate for most targets or one needs the
>> > more complicated route of costing during pattern matching already.
>> 
>> Hmm, OK.  That seems like a cost-model thing though, rather than something
>> that should be exposed through optabs.  And I imagine the open-coded
>> version would still be better than nothing on targets without highpart multiply.
>> 
>> So how about replacing the hook with one that simply asks whether division
>> through highpart multiplication is preferred over the add/shift sequence?
>> (Unfortunately it's not going to be possible to work that out from existing
>> information.)
>
> So this doesn't work for SVE.  For SVE the multiplication widening pass introduces
> FMAs at gimple level.  So in the cases where the operation is fed from a widening
> multiplication we end up generating FMA.  If that was it I could have matched FMA.
>
> But it also pushes the multiplication in the second operand because it no longer has
> a mul to share the results with.
>
> In any case, the gimple code is transformed into
>
> vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121);
> vect_patt_57.9_123 = (vector([8,8]) unsigned short) vect__3.8_122;
> vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, { 257, ... });
> vect_patt_65.12_128 = vect_patt_64.11_127 >> 8;
> vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124, vect_patt_65.12_128);
> vect_patt_62.14_130 = vect_patt_66.13_129 >> 8;
> vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21) vect_patt_62.14_130;
>
> This transformation is much worse than the original code, it extended the dependency
> chain with another expensive instruction. I can try to correct this in RTL by matching
> FMA and shift and splitting into MUL + ADDHNB and hope CSE takes care of the extra mul.
>
> But this seems like a hack, and it's basically undoing the earlier transformation.  It seems to
> me that the open coding is a bad idea.

Could you post the patch that gives this result?  I'll have a poke around.

Thanks,
Richard

> Do you still want it Richard?
>
> Thanks,
> Tamar
>> 
>> Thanks,
>> Richard
>> 
>> >
>> >>
>> >> Some comments in addition to Richard's:
>> >>
>> >> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> >> > Hi All,
>> >> >
>> >> > As discussed in the ticket, this replaces the approach for
>> >> > optimizing the div by bitmask operation from a hook into optabs
>> >> > implemented through add_highpart.
>> >> >
>> >> > In order to be able to use this we need to check whether the
>> >> > current precision has enough bits to do the operation without any
>> >> > of the additions
>> >> overflowing.
>> >> >
>> >> > We use range information to determine this and only do the
>> >> > operation if we're sure am overflow won't occur.
>> >> >
>> >> > Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going>
>> >> issues.
>> >> >
>> >> > Ok for master?
>> >> >
>> >> > Thanks,
>> >> > Tamar
>> >> >
>> >> > gcc/ChangeLog:
>> >> >
>> >> > 	PR target/108583
>> >> > 	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST):
>> >> Remove.
>> >> > 	* doc/tm.texi.in: Likewise.
>> >> > 	* explow.cc (round_push, align_dynamic_address): Revert previous
>> >> patch.
>> >> > 	* expmed.cc (expand_divmod): Likewise.
>> >> > 	* expmed.h (expand_divmod): Likewise.
>> >> > 	* expr.cc (force_operand, expand_expr_divmod): Likewise.
>> >> > 	* optabs.cc (expand_doubleword_mod,
>> >> expand_doubleword_divmod): Likewise.
>> >> > 	* internal-fn.def (ADDH): New.
>> >> > 	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
>> >> > 	* doc/md.texi: Document them.
>> >> > 	* doc/rtl.texi: Likewise.
>> >> > 	* target.def (can_special_div_by_const): Remove.
>> >> > 	* target.h: Remove tree-core.h include
>> >> > 	* targhooks.cc (default_can_special_div_by_const): Remove.
>> >> > 	* targhooks.h (default_can_special_div_by_const): Remove.
>> >> > 	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
>> >> > 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook
>> >> and
>> >> > 	implement new obtab recognition based on range.
>> >> > 	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.
>> >> >
>> >> > gcc/testsuite/ChangeLog:
>> >> >
>> >> > 	PR target/108583
>> >> > 	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
>> >> > 	* gcc.dg/vect/vect-div-bitmask-5.c: New test.
>> >> >
>> >> > --- inline copy of patch --
>> >> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index
>> >> >
>> >>
>> 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f
>> 74080
>> >> 3
>> >> > 8595e21af35d 100644
>> >> > --- a/gcc/doc/md.texi
>> >> > +++ b/gcc/doc/md.texi
>> >> > @@ -5668,6 +5668,18 @@ represented in RTL using a
>> >> @code{smul_highpart} RTX expression.
>> >> >  Similar, but the multiplication is unsigned.  This may be
>> >> > represented in RTL using an @code{umul_highpart} RTX expression.
>> >> >
>> >> > +@cindex @code{sadd@var{m}3_highpart} instruction pattern @item
>> >> > +@samp{smul@var{m}3_highpart}
>> >>
>> >> sadd
>> >>
>> >> > +Perform a signed addition of operands 1 and 2, which have mode
>> >> > +@var{m}, and store the most significant half of the product in operand
>> 0.
>> >> > +The least significant half of the product is discarded.  This may
>> >> > +be represented in RTL using a @code{sadd_highpart} RTX expression.
>> >> > +
>> >> > +@cindex @code{uadd@var{m}3_highpart} instruction pattern @item
>> >> > +@samp{uadd@var{m}3_highpart} Similar, but the addition is unsigned.
>> >> > +This may be represented in RTL using an @code{uadd_highpart} RTX
>> >> > +expression.
>> >> > +
>> >> >  @cindex @code{madd@var{m}@var{n}4} instruction pattern  @item
>> >> > @samp{madd@var{m}@var{n}4}  Multiply operands 1 and 2, sign-
>> extend
>> >> > them to mode @var{n}, add diff --git a/gcc/doc/rtl.texi
>> >> > b/gcc/doc/rtl.texi index
>> >> >
>> >>
>> d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00
>> 343
>> >> d17
>> >> > 1940ec4222f3 100644
>> >> > --- a/gcc/doc/rtl.texi
>> >> > +++ b/gcc/doc/rtl.texi
>> >> > @@ -2535,6 +2535,17 @@ out in machine mode @var{m}.
>> >> > @code{smul_highpart} returns the high part  of a signed
>> >> > multiplication, @code{umul_highpart} returns the high part  of an
>> >> > unsigned
>> >> multiplication.
>> >> >
>> >> > +@findex sadd_highpart
>> >> > +@findex uadd_highpart
>> >> > +@cindex high-part addition
>> >> > +@cindex addition high part
>> >> > +@item (sadd_highpart:@var{m} @var{x} @var{y}) @itemx
>> >> > +(uadd_highpart:@var{m} @var{x} @var{y}) Represents the high-part
>> >> > +addition of @var{x} and @var{y} carried out in machine mode @var{m}.
>> >> > +@code{sadd_highpart} returns the high part of a signed addition,
>> >> > +@code{uadd_highpart} returns the high part of an unsigned addition.
>> >>
>> >> The patch doesn't add these RTL codes though.
>> >>
>> >> > +
>> >> >  @findex fma
>> >> >  @cindex fused multiply-add
>> >> >  @item (fma:@var{m} @var{x} @var{y} @var{z}) diff --git
>> >> > a/gcc/doc/tm.texi b/gcc/doc/tm.texi index
>> >> >
>> >>
>> c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d57
>> 914840
>> >> 17e
>> >> > 6b0d62ab077e 100644
>> >> > --- a/gcc/doc/tm.texi
>> >> > +++ b/gcc/doc/tm.texi
>> >> > @@ -6137,22 +6137,6 @@ instruction pattern.  There is no need for
>> >> > the hook to handle these two  implementation approaches itself.
>> >> >  @end deftypefn
>> >> >
>> >> > -@deftypefn {Target Hook} bool
>> >> > TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum
>> >> @var{tree_code}, tree
>> >> > @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx
>> >> > @var{in0}, rtx @var{in1}) -This hook is used to test whether the
>> >> > target has a special method of -division of vectors of type
>> >> > @var{vectype}
>> >> using the value @var{constant}, -and producing a vector of type
>> >> @var{vectype}.  The division -will then not be decomposed by the
>> >> vectorizer and kept as a div.
>> >> > -
>> >> > -When the hook is being used to test whether the target supports a
>> >> > special -divide, @var{in0}, @var{in1}, and @var{output} are all null.
>> >> > When the hook -is being used to emit a division, @var{in0} and
>> >> > @var{in1} are the source -vectors of type @var{vecttype} and
>> >> > @var{output} is the destination vector of -type @var{vectype}.
>> >> > -
>> >> > -Return true if the operation is possible, emitting instructions
>> >> > for it -if rtxes are provided and updating @var{output}.
>> >> > -@end deftypefn
>> >> > -
>> >> >  @deftypefn {Target Hook} tree
>> >> > TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned
>> >> @var{code},
>> >> > tree @var{vec_type_out}, tree @var{vec_type_in})  This hook should
>> >> > return the decl of a function that implements the  vectorized
>> >> > variant of the function with the @code{combined_fn} code diff --git
>> >> > a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index
>> >> >
>> >>
>> 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d
>> 1efec0
>> >> a3a
>> >> > bccd1c293c7b 100644
>> >> > --- a/gcc/doc/tm.texi.in
>> >> > +++ b/gcc/doc/tm.texi.in
>> >> > @@ -4173,8 +4173,6 @@ address;  but often a machine-dependent
>> >> strategy can generate better code.
>> >> >
>> >> >  @hook TARGET_VECTORIZE_VEC_PERM_CONST
>> >> >
>> >> > -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
>> >> > -
>> >> >  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
>> >> >
>> >> >  @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
>> >> > diff --git a/gcc/explow.cc b/gcc/explow.cc index
>> >> >
>> >>
>> 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc
>> 212f0
>> >> bef
>> >> > a016eea4573c 100644
>> >> > --- a/gcc/explow.cc
>> >> > +++ b/gcc/explow.cc
>> >> > @@ -1037,7 +1037,7 @@ round_push (rtx size)
>> >> >       TRUNC_DIV_EXPR.  */
>> >> >    size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
>> >> >  		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
>> >> > -  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
>> >> > size, align_rtx,
>> >> > +  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
>> >> >  			NULL_RTX, 1);
>> >> >    size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
>> >> >
>> >> > @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target, unsigned
>> >> required_align)
>> >> >  			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
>> >> >  				       Pmode),
>> >> >  			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
>> >> > -  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
>> >> > target,
>> >> > +  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
>> >> >  			  gen_int_mode (required_align / BITS_PER_UNIT,
>> >> >  					Pmode),
>> >> >  			  NULL_RTX, 1);
>> >> > diff --git a/gcc/expmed.h b/gcc/expmed.h index
>> >> >
>> >>
>> 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c
>> 5364
>> >> 094
>> >> > 1628068f3901 100644
>> >> > --- a/gcc/expmed.h
>> >> > +++ b/gcc/expmed.h
>> >> > @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code,
>> >> > machine_mode, rtx, poly_int64, rtx,  extern rtx maybe_expand_shift
>> >> (enum tree_code, machine_mode, rtx, int, rtx,
>> >> >  			       int);
>> >> >  #ifdef GCC_OPTABS_H
>> >> > -extern rtx expand_divmod (int, enum tree_code, machine_mode, tree,
>> >> tree,
>> >> > -			  rtx, rtx, rtx, int,
>> >> > -			  enum optab_methods = OPTAB_LIB_WIDEN);
>> >> > +extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx,
>> >> rtx,
>> >> > +			  rtx, int, enum optab_methods =
>> >> OPTAB_LIB_WIDEN);
>> >> >  #endif
>> >> >  #endif
>> >> >
>> >> > diff --git a/gcc/expmed.cc b/gcc/expmed.cc index
>> >> >
>> >>
>> 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025a
>> b18a3
>> >> a59
>> >> > c169d3b7692f 100644
>> >> > --- a/gcc/expmed.cc
>> >> > +++ b/gcc/expmed.cc
>> >> > @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode mode,
>> rtx
>> >> op0,
>> >> > HOST_WIDE_INT d)
>> >> >
>> >> >  rtx
>> >> >  expand_divmod (int rem_flag, enum tree_code code, machine_mode
>> >> mode,
>> >> > -	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
>> >> > -	       int unsignedp, enum optab_methods methods)
>> >> > +	       rtx op0, rtx op1, rtx target, int unsignedp,
>> >> > +	       enum optab_methods methods)
>> >> >  {
>> >> >    machine_mode compute_mode;
>> >> >    rtx tquotient;
>> >> > @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum
>> tree_code
>> >> > code, machine_mode mode,
>> >> >
>> >> >    last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) :
>> >> > 0;
>> >> >
>> >> > -  /* Check if the target has specific expansions for the division.
>> >> > */
>> >> > -  tree cst;
>> >> > -  if (treeop0
>> >> > -      && treeop1
>> >> > -      && (cst = uniform_integer_cst_p (treeop1))
>> >> > -      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE
>> >> (treeop0),
>> >> > -						     wi::to_wide (cst),
>> >> > -						     &target, op0, op1))
>> >> > -    return target;
>> >> > -
>> >> > -
>> >> >    /* Now convert to the best mode to use.  */
>> >> >    if (compute_mode != mode)
>> >> >      {
>> >> > @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum
>> tree_code
>> >> code, machine_mode mode,
>> >> >  			    || (optab_handler (sdivmod_optab, int_mode)
>> >> >  				!= CODE_FOR_nothing)))
>> >> >  		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
>> >> > -						int_mode, treeop0, treeop1,
>> >> > -						op0, gen_int_mode (abs_d,
>> >> > +						int_mode, op0,
>> >> > +						gen_int_mode (abs_d,
>> >> >  							      int_mode),
>> >> >  						NULL_RTX, 0);
>> >> >  		    else
>> >> > @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum
>> tree_code
>> >> code, machine_mode mode,
>> >> >  				      size - 1, NULL_RTX, 0);
>> >> >  		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
>> >> >  				    NULL_RTX);
>> >> > -		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode,
>> >> treeop0,
>> >> > -				    treeop1, t3, op1, NULL_RTX, 0);
>> >> > +		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3,
>> >> op1,
>> >> > +				    NULL_RTX, 0);
>> >> >  		if (t4)
>> >> >  		  {
>> >> >  		    rtx t5;
>> >> > diff --git a/gcc/expr.cc b/gcc/expr.cc index
>> >> >
>> >>
>> 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e75521633907
>> 8d5b
>> >> 2280
>> >> > c6e277f26d72 100644
>> >> > --- a/gcc/expr.cc
>> >> > +++ b/gcc/expr.cc
>> >> > @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
>> >> >  	    return expand_divmod (0,
>> >> >  				  FLOAT_MODE_P (GET_MODE (value))
>> >> >  				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
>> >> > -				  GET_MODE (value), NULL, NULL, op1, op2,
>> >> > -				  target, 0);
>> >> > +				  GET_MODE (value), op1, op2, target, 0);
>> >> >  	case MOD:
>> >> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
>> >> NULL, NULL,
>> >> > -				op1, op2, target, 0);
>> >> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
>> >> op1, op2,
>> >> > +				target, 0);
>> >> >  	case UDIV:
>> >> > -	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
>> >> NULL, NULL,
>> >> > -				op1, op2, target, 1);
>> >> > +	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value),
>> >> op1, op2,
>> >> > +				target, 1);
>> >> >  	case UMOD:
>> >> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
>> >> NULL, NULL,
>> >> > -				op1, op2, target, 1);
>> >> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value),
>> >> op1, op2,
>> >> > +				target, 1);
>> >> >  	case ASHIFTRT:
>> >> >  	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
>> >> >  				      target, 0, OPTAB_LIB_WIDEN); @@ -
>> >> 9170,13 +9169,11 @@
>> >> > expand_expr_divmod (tree_code code, machine_mode mode, tree
>> >> treeop0,
>> >> >        bool speed_p = optimize_insn_for_speed_p ();
>> >> >        do_pending_stack_adjust ();
>> >> >        start_sequence ();
>> >> > -      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
>> >> > -				   op0, op1, target, 1);
>> >> > +      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1,
>> >> > + target, 1);
>> >> >        rtx_insn *uns_insns = get_insns ();
>> >> >        end_sequence ();
>> >> >        start_sequence ();
>> >> > -      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
>> >> > -				   op0, op1, target, 0);
>> >> > +      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1,
>> >> > + target, 0);
>> >> >        rtx_insn *sgn_insns = get_insns ();
>> >> >        end_sequence ();
>> >> >        unsigned uns_cost = seq_cost (uns_insns, speed_p); @@
>> >> > -9198,8
>> >> > +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode
>> >> mode, tree treeop0,
>> >> >        emit_insn (sgn_insns);
>> >> >        return sgn_ret;
>> >> >      }
>> >> > -  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
>> >> > -			op0, op1, target, unsignedp);
>> >> > +  return expand_divmod (mod_p, code, mode, op0, op1, target,
>> >> > + unsignedp);
>> >> >  }
>> >> >
>> >> >  rtx
>> >> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
>> >> >
>> >>
>> 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052
>> 584f5a
>> >> 3b
>> >> > 8a734baa800f 100644
>> >> > --- a/gcc/internal-fn.def
>> >> > +++ b/gcc/internal-fn.def
>> >> > @@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL,
>> >> ECF_CONST
>> >> > | ECF_NOTHROW, first,
>> >> >
>> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST |
>> >> ECF_NOTHROW, first,
>> >> >  			      smul_highpart, umul_highpart, binary)
>> >> > +DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST |
>> >> ECF_NOTHROW, first,
>> >> > +			      sadd_highpart, uadd_highpart, binary)
>> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST |
>> >> ECF_NOTHROW, first,
>> >> >  			      smulhs, umulhs, binary)
>> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST |
>> >> ECF_NOTHROW, first,
>> >> > diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
>> >> >
>> >>
>> cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe03
>> 69a6
>> >> e
>> >> > 77082c1e617b 100644
>> >> > --- a/gcc/optabs.cc
>> >> > +++ b/gcc/optabs.cc
>> >> > @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode
>> >> mode, rtx op0, rtx op1, bool unsignedp)
>> >> >  		return NULL_RTX;
>> >> >  	    }
>> >> >  	}
>> >> > -      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode,
>> >> NULL, NULL,
>> >> > -				     sum, gen_int_mode (INTVAL (op1),
>> >> > -							word_mode),
>> >> > +      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR,
>> word_mode,
>> >> sum,
>> >> > +				     gen_int_mode (INTVAL (op1),
>> >> word_mode),
>> >> >  				     NULL_RTX, 1, OPTAB_DIRECT);
>> >> >        if (remainder == NULL_RTX)
>> >> >  	return NULL_RTX;
>> >> > @@ -1211,8 +1210,8 @@ expand_doubleword_divmod (machine_mode
>> >> mode, rtx
>> >> > op0, rtx op1, rtx *rem,
>> >> >
>> >> >    if (op11 != const1_rtx)
>> >> >      {
>> >> > -      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL,
>> NULL,
>> >> quot1,
>> >> > -				op11, NULL_RTX, unsignedp,
>> >> OPTAB_DIRECT);
>> >> > +      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1,
>> >> op11,
>> >> > +				NULL_RTX, unsignedp, OPTAB_DIRECT);
>> >> >        if (rem2 == NULL_RTX)
>> >> >  	return NULL_RTX;
>> >> >
>> >> > @@ -1226,8 +1225,8 @@ expand_doubleword_divmod (machine_mode
>> >> mode, rtx op0, rtx op1, rtx *rem,
>> >> >        if (rem2 == NULL_RTX)
>> >> >  	return NULL_RTX;
>> >> >
>> >> > -      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL,
>> NULL,
>> >> quot1,
>> >> > -				 op11, NULL_RTX, unsignedp,
>> >> OPTAB_DIRECT);
>> >> > +      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1,
>> op11,
>> >> > +				 NULL_RTX, unsignedp, OPTAB_DIRECT);
>> >> >        if (quot2 == NULL_RTX)
>> >> >  	return NULL_RTX;
>> >> >
>> >> > diff --git a/gcc/optabs.def b/gcc/optabs.def index
>> >> >
>> >>
>> 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2
>> d7a5
>> >> ccb
>> >> > f6147947351a 100644
>> >> > --- a/gcc/optabs.def
>> >> > +++ b/gcc/optabs.def
>> >> > @@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
>> >> >
>> >> >  OPTAB_D (smul_highpart_optab, "smul$a3_highpart")  OPTAB_D
>> >> > (umul_highpart_optab, "umul$a3_highpart")
>> >> > +OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart") OPTAB_D
>> >> > +(uadd_highpart_optab, "uadd$a3_highpart")
>> >> >
>> >> >  OPTAB_D (cmpmem_optab, "cmpmem$a")  OPTAB_D (cmpstr_optab,
>> >> > "cmpstr$a") diff --git a/gcc/target.def b/gcc/target.def index
>> >> >
>> >>
>> db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed0
>> 8d1d
>> >> 81a
>> >> > fa2c2baa64a5 100644
>> >> > --- a/gcc/target.def
>> >> > +++ b/gcc/target.def
>> >> > @@ -1905,25 +1905,6 @@ implementation approaches itself.",
>> >> >  	const vec_perm_indices &sel),
>> >> >   NULL)
>> >> >
>> >> > -DEFHOOK
>> >> > -(can_special_div_by_const,
>> >> > - "This hook is used to test whether the target has a special
>> >> > method of\n\ -division of vectors of type @var{vectype} using the
>> >> > value @var{constant},\n\ -and producing a vector of type
>> >> > @var{vectype}.  The division\n\ -will then not be decomposed by the
>> >> > vectorizer and kept as a div.\n\ -\n\ -When the hook is being used
>> >> > to test whether the target supports a special\n\ -divide,
>> >> > @var{in0}, @var{in1}, and @var{output} are all null.  When the
>> >> > hook\n\ -is being used to emit a division, @var{in0} and @var{in1}
>> >> > are the source\n\ -vectors of type @var{vecttype} and @var{output}
>> >> > is the destination vector of\n\ -type @var{vectype}.\n\ -\n\
>> >> > -Return true if the operation is possible, emitting instructions
>> >> > for it\n\ -if rtxes are provided and updating @var{output}.",
>> >> > - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
>> >> > -	rtx in0, rtx in1),
>> >> > - default_can_special_div_by_const)
>> >> > -
>> >> >  /* Return true if the target supports misaligned store/load of a
>> >> >     specific factor denoted in the third parameter.  The last parameter
>> >> >     is true if the access is defined in a packed struct.  */ diff
>> >> > --git a/gcc/target.h b/gcc/target.h index
>> >> >
>> >>
>> 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fc
>> a82b9
>> >> 9f9
>> >> > 13158c2d47b1 100644
>> >> > --- a/gcc/target.h
>> >> > +++ b/gcc/target.h
>> >> > @@ -51,7 +51,6 @@
>> >> >  #include "insn-codes.h"
>> >> >  #include "tm.h"
>> >> >  #include "hard-reg-set.h"
>> >> > -#include "tree-core.h"
>> >> >
>> >> >  #if CHECKING_P
>> >> >
>> >> > diff --git a/gcc/targhooks.h b/gcc/targhooks.h index
>> >> >
>> >>
>> a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad2
>> 24454
>> >> 93
>> >> > 17a31390f0c2 100644
>> >> > --- a/gcc/targhooks.h
>> >> > +++ b/gcc/targhooks.h
>> >> > @@ -209,8 +209,6 @@ extern void default_addr_space_diagnose_usage
>> >> > (addr_space_t, location_t);  extern rtx default_addr_space_convert
>> >> > (rtx, tree, tree);  extern unsigned int
>> >> > default_case_values_threshold (void);  extern bool
>> >> > default_have_conditional_execution (void); -extern bool
>> >> > default_can_special_div_by_const (enum tree_code, tree,
>> >> wide_int,
>> >> > -					      rtx *, rtx, rtx);
>> >> >
>> >> >  extern bool default_libc_has_function (enum function_class, tree);
>> >> > extern bool default_libc_has_fast_function (int fcode); diff --git
>> >> > a/gcc/targhooks.cc b/gcc/targhooks.cc index
>> >> >
>> >>
>> fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2
>> da91e
>> >> 03
>> >> > 877337a931e7 100644
>> >> > --- a/gcc/targhooks.cc
>> >> > +++ b/gcc/targhooks.cc
>> >> > @@ -1840,14 +1840,6 @@ default_have_conditional_execution (void)
>> >> >    return HAVE_conditional_execution;  }
>> >> >
>> >> > -/* Default that no division by constant operations are special.
>> >> > */ -bool -default_can_special_div_by_const (enum tree_code, tree,
>> >> > wide_int, rtx *, rtx,
>> >> > -				  rtx)
>> >> > -{
>> >> > -  return false;
>> >> > -}
>> >> > -
>> >> >  /* By default we assume that c99 functions are present at the runtime,
>> >> >     but sincos is not.  */
>> >> >  bool
>> >> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
>> >> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
>> >> > new file mode 100644
>> >> > index
>> >> >
>> >>
>> 0000000000000000000000000000000000000000..c81f8946922250234b
>> f759e0a0
>> >> a0
>> >> > 4ea8c1f73e3c
>> >> > --- /dev/null
>> >> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
>> >> > @@ -0,0 +1,25 @@
>> >> > +/* { dg-require-effective-target vect_int } */
>> >> > +
>> >> > +#include <stdint.h>
>> >> > +#include "tree-vect.h"
>> >> > +
>> >> > +typedef unsigned __attribute__((__vector_size__ (16))) V;
>> >> > +
>> >> > +static __attribute__((__noinline__)) __attribute__((__noclone__))
>> >> > +V foo (V v, unsigned short i) {
>> >> > +  v /= i;
>> >> > +  return v;
>> >> > +}
>> >> > +
>> >> > +int
>> >> > +main (void)
>> >> > +{
>> >> > +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff
>> >> > +}, 0xffff);
>> >> > +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
>> >> > +    if (v[i] != 0x00010001)
>> >> > +      __builtin_abort ();
>> >> > +  return 0;
>> >> > +}
>> >> > +
>> >> > +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern:
>> >> > +detected" "vect" { target aarch64*-*-* } } } */
>> >> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
>> >> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
>> >> > new file mode 100644
>> >> > index
>> >> >
>> >>
>> 0000000000000000000000000000000000000000..b4eb1a4dacba481e63
>> 06b4991
>> >> 4d2
>> >> > a29b933de625
>> >> > --- /dev/null
>> >> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
>> >> > @@ -0,0 +1,58 @@
>> >> > +/* { dg-require-effective-target vect_int } */
>> >> > +
>> >> > +#include <stdint.h>
>> >> > +#include <stdio.h>
>> >> > +#include "tree-vect.h"
>> >> > +
>> >> > +#define N 50
>> >> > +#define TYPE uint8_t
>> >> > +
>> >> > +#ifndef DEBUG
>> >> > +#define DEBUG 0
>> >> > +#endif
>> >> > +
>> >> > +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
>> >> > +
>> >> > +
>> >> > +__attribute__((noipa, noinline, optimize("O1"))) void fun1(TYPE*
>> >> > +restrict pixel, TYPE level, int n) {
>> >> > +  for (int i = 0; i < n; i+=1)
>> >> > +    pixel[i] = (pixel[i] + level) / 0xff; }
>> >> > +
>> >> > +__attribute__((noipa, noinline, optimize("O3"))) void fun2(TYPE*
>> >> > +restrict pixel, TYPE level, int n) {
>> >> > +  for (int i = 0; i < n; i+=1)
>> >> > +    pixel[i] = (pixel[i] + level) / 0xff; }
>> >> > +
>> >> > +int main ()
>> >> > +{
>> >> > +  TYPE a[N];
>> >> > +  TYPE b[N];
>> >> > +
>> >> > +  for (int i = 0; i < N; ++i)
>> >> > +    {
>> >> > +      a[i] = BASE + i * 13;
>> >> > +      b[i] = BASE + i * 13;
>> >> > +      if (DEBUG)
>> >> > +        printf ("%d: 0x%x\n", i, a[i]);
>> >> > +    }
>> >> > +
>> >> > +  fun1 (a, N / 2, N);
>> >> > +  fun2 (b, N / 2, N);
>> >> > +
>> >> > +  for (int i = 0; i < N; ++i)
>> >> > +    {
>> >> > +      if (DEBUG)
>> >> > +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
>> >> > +
>> >> > +      if (a[i] != b[i])
>> >> > +        __builtin_abort ();
>> >> > +    }
>> >> > +  return 0;
>> >> > +}
>> >> > +
>> >> > +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect"
>> >> > +{ target aarch64*-*-* } } } */
>> >> > diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
>> >> > index
>> >> >
>> >>
>> 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c
>> 14077d
>> >> c3
>> >> > e970bed75ef6 100644
>> >> > --- a/gcc/tree-vect-generic.cc
>> >> > +++ b/gcc/tree-vect-generic.cc
>> >> > @@ -1237,17 +1237,6 @@ expand_vector_operation
>> >> (gimple_stmt_iterator *gsi, tree type, tree compute_type
>> >> >  	  tree rhs2 = gimple_assign_rhs2 (assign);
>> >> >  	  tree ret;
>> >> >
>> >> > -	  /* Check if the target was going to handle it through the special
>> >> > -	     division callback hook.  */
>> >> > -	  tree cst = uniform_integer_cst_p (rhs2);
>> >> > -	  if (cst &&
>> >> > -	      targetm.vectorize.can_special_div_by_const (code, type,
>> >> > -							  wi::to_wide (cst),
>> >> > -							  NULL,
>> >> > -							  NULL_RTX,
>> >> NULL_RTX))
>> >> > -	    return NULL_TREE;
>> >> > -
>> >> > -
>> >> >  	  if (!optimize
>> >> >  	      || !VECTOR_INTEGER_TYPE_P (type)
>> >> >  	      || TREE_CODE (rhs2) != VECTOR_CST diff --git
>> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
>> >> >
>> >>
>> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc46
>> 7f33
>> >> 69
>> >> > de2afea139d6 100644
>> >> > --- a/gcc/tree-vect-patterns.cc
>> >> > +++ b/gcc/tree-vect-patterns.cc
>> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info
>> *vinfo,
>> >> >        return pattern_stmt;
>> >> >      }
>> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
>> >> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
>> >> vectype,
>> >> > -							  wi::to_wide (cst),
>> >> > -							  NULL, NULL_RTX,
>> >> > -							  NULL_RTX))
>> >> > +	   && TYPE_UNSIGNED (itype)
>> >> > +	   && rhs_code == TRUNC_DIV_EXPR
>> >> > +	   && vectype
>> >> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
>> >> > +					      OPTIMIZE_FOR_SPEED))
>> >> >      {
>> >> > -      return NULL;
>> >> > +      /* div optimizations using narrowings
>> >> > +       we can do the division e.g. shorts by 255 faster by calculating it as
>> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
>> >> > +       double the precision of x.
>> >> > +
>> >> > +       If we imagine a short as being composed of two blocks of bytes then
>> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is
>> equivalent to
>> >> > +       adding 1 to each sub component:
>> >> > +
>> >> > +	    short value of 16-bits
>> >> > +       ┌──────────────┬────────────────┐
>> >> > +       │              │                │
>> >> > +       └──────────────┴────────────────┘
>> >> > +	 8-bit part1 ▲  8-bit part2   ▲
>> >> > +		     │                │
>> >> > +		     │                │
>> >> > +		    +1               +1
>> >> > +
>> >> > +       after the first addition, we have to shift right by 8, and narrow the
>> >> > +       results back to a byte.  Remember that the addition must be done in
>> >> > +       double the precision of the input.  However if we know that
>> >> > + the
>> >> addition
>> >> > +       `x + 257` does not overflow then we can do the operation in
>> >> > + the
>> >> current
>> >> > +       precision.  In which case we don't need the pack and unpacks.  */
>> >> > +      auto wcst = wi::to_wide (cst);
>> >> > +      int pow = wi::exact_log2 (wcst + 1);
>> >> > +      if (pow == (int) (element_precision (vectype) / 2))
>> >> > +	{
>> >> > +	  wide_int min,max;
>> >> > +	  /* If we're in a pattern we need to find the orginal definition.  */
>> >> > +	  tree op0 = oprnd0;
>> >> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
>> >> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
>> >> > +	  if (is_pattern_stmt_p (stmt_info))
>> >> > +	    {
>> >> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
>> >> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>> >> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
>> >> > +	    }
>> >>
>> >> If this is generally safe (I'm skipping thinking about it in the
>> >> interests of a quick review :-)), then I think it should be done in
>> >> vect_get_range_info instead.  Using gimple_get_lhs would be more
>> >> general than handling just assignments.
>> >>
>> >> > +
>> >> > +	  /* Check that no overflow will occur.  If we don't have range
>> >> > +	     information we can't perform the optimization.  */
>> >> > +	  if (vect_get_range_info (op0, &min, &max))
>> >> > +	    {
>> >> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
>> >> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
>> >> > +	      wi::overflow_type ovf;
>> >> > +	      /* We need adder and max in the same precision.  */
>> >> > +	      wide_int zadder
>> >> > +		= wide_int_storage::from (adder, wi::get_precision (max),
>> >> > +					  UNSIGNED);
>> >> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
>> >>
>> >> Could you explain this a bit more?  When do we have mismatched
>> >> precisions?
>> >
>> > C promotion rules will promote e.g.
>> >
>> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
>> >   for (int i = 0; i < n; i+=1)
>> >     pixel[i] = (pixel[i] + level) / 0xff; }
>> >
>> > And have the addition be done as a 32 bit integer.  The vectorizer
>> > will demote this down to a short, but range information is not stored
>> > for patterns.  So In the above the range will correctly be 0x1fe but
>> > the precision will be that of the original expression, so 32.  This
>> > will be a mismatch with itype which is derived from the size the vectorizer
>> will perform the operation in.
>> >
>> > Thanks,
>> > Tamar
>> >
>> >>
>> >> Thanks,
>> >> Richard
>> >>
>> >> > +	      if (ovf == wi::OVF_NONE)
>> >> > +		{
>> >> > +		  *type_out = vectype;
>> >> > +		  tree tadder = wide_int_to_tree (itype, adder);
>> >> > +		  gcall *patt1
>> >> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0,
>> >> tadder);
>> >> > +		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
>> >> > +		  gimple_call_set_lhs (patt1, lhs);
>> >> > +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1,
>> >> vectype);
>> >> > +
>> >> > +		  pattern_stmt
>> >> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
>> >> > +		  lhs = vect_recog_temp_ssa_var (itype, NULL);
>> >> > +		  gimple_call_set_lhs (pattern_stmt, lhs);
>> >> > +
>> >> > +		  return pattern_stmt;
>> >> > +		}
>> >> > +	    }
>> >> > +	}
>> >> >      }
>> >> >
>> >> >    if (prec > HOST_BITS_PER_WIDE_INT diff --git
>> >> > a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index
>> >> >
>> >>
>> eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0
>> b95
>> >> 64f
>> >> > c4e066e50081 100644
>> >> > --- a/gcc/tree-vect-stmts.cc
>> >> > +++ b/gcc/tree-vect-stmts.cc
>> >> > @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
>> >> >  	}
>> >> >        target_support_p = (optab_handler (optab, vec_mode)
>> >> >  			  != CODE_FOR_nothing);
>> >> > -      tree cst;
>> >> > -      if (!target_support_p
>> >> > -	  && op1
>> >> > -	  && (cst = uniform_integer_cst_p (op1)))
>> >> > -	target_support_p
>> >> > -	  = targetm.vectorize.can_special_div_by_const (code, vectype,
>> >> > -							wi::to_wide (cst),
>> >> > -							NULL, NULL_RTX,
>> >> > -							NULL_RTX);
>> >> >      }
>> >> >
>> >> >    bool using_emulated_vectors_p = vect_emulated_vector_p
>> >> > (vectype);
Tamar Christina Feb. 27, 2023, 12:14 p.m. UTC | #35
> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Monday, February 27, 2023 12:12 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
> > Hi,
> >
> >> > I avoided open coding it with add and shift because it creates a 4
> >> > instructions (and shifts which are typically slow) dependency chain
> >> > instead of a load and multiply.  This change, unless the target is
> >> > known to optimize it further is unlikely to be beneficial.  And by
> >> > the time we get to costing the only alternative is to undo the
> >> > existing pattern and
> >> so you lose the general shift optimization.
> >> >
> >> > So it seemed unwise to open code as shifts, given the codegen out
> >> > of the vectorizer would be degenerate for most targets or one needs
> >> > the more complicated route of costing during pattern matching already.
> >>
> >> Hmm, OK.  That seems like a cost-model thing though, rather than
> >> something that should be exposed through optabs.  And I imagine the
> >> open-coded version would still be better than nothing on targets without
> highpart multiply.
> >>
> >> So how about replacing the hook with one that simply asks whether
> >> division through highpart multiplication is preferred over the add/shift
> sequence?
> >> (Unfortunately it's not going to be possible to work that out from
> >> existing
> >> information.)
> >
> > So this doesn't work for SVE.  For SVE the multiplication widening
> > pass introduces FMAs at gimple level.  So in the cases where the
> > operation is fed from a widening multiplication we end up generating FMA.
> If that was it I could have matched FMA.
> >
> > But it also pushes the multiplication in the second operand because it
> > no longer has a mul to share the results with.
> >
> > In any case, the gimple code is transformed into
> >
> > vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121);
> > vect_patt_57.9_123 = (vector([8,8]) unsigned short) vect__3.8_122;
> > vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, { 257,
> > ... });
> > vect_patt_65.12_128 = vect_patt_64.11_127 >> 8;
> > vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124,
> > vect_patt_65.12_128);
> > vect_patt_62.14_130 = vect_patt_66.13_129 >> 8;
> > vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21)
> > vect_patt_62.14_130;
> >
> > This transformation is much worse than the original code, it extended
> > the dependency chain with another expensive instruction. I can try to
> > correct this in RTL by matching FMA and shift and splitting into MUL +
> ADDHNB and hope CSE takes care of the extra mul.
> >
> > But this seems like a hack, and it's basically undoing the earlier
> > transformation.  It seems to me that the open coding is a bad idea.
> 
> Could you post the patch that gives this result?  I'll have a poke around.

Sure, I'll post the new series, it needs all of them.

Tamar.

> 
> Thanks,
> Richard
> 
> > Do you still want it Richard?
> >
> > Thanks,
> > Tamar
> >>
> >> Thanks,
> >> Richard
> >>
> >> >
> >> >>
> >> >> Some comments in addition to Richard's:
> >> >>
> >> >> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> >> > Hi All,
> >> >> >
> >> >> > As discussed in the ticket, this replaces the approach for
> >> >> > optimizing the div by bitmask operation from a hook into optabs
> >> >> > implemented through add_highpart.
> >> >> >
> >> >> > In order to be able to use this we need to check whether the
> >> >> > current precision has enough bits to do the operation without
> >> >> > any of the additions
> >> >> overflowing.
> >> >> >
> >> >> > We use range information to determine this and only do the
> >> >> > operation if we're sure am overflow won't occur.
> >> >> >
> >> >> > Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going>
> >> >> issues.
> >> >> >
> >> >> > Ok for master?
> >> >> >
> >> >> > Thanks,
> >> >> > Tamar
> >> >> >
> >> >> > gcc/ChangeLog:
> >> >> >
> >> >> > 	PR target/108583
> >> >> > 	* doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST):
> >> >> Remove.
> >> >> > 	* doc/tm.texi.in: Likewise.
> >> >> > 	* explow.cc (round_push, align_dynamic_address): Revert
> >> >> > previous
> >> >> patch.
> >> >> > 	* expmed.cc (expand_divmod): Likewise.
> >> >> > 	* expmed.h (expand_divmod): Likewise.
> >> >> > 	* expr.cc (force_operand, expand_expr_divmod): Likewise.
> >> >> > 	* optabs.cc (expand_doubleword_mod,
> >> >> expand_doubleword_divmod): Likewise.
> >> >> > 	* internal-fn.def (ADDH): New.
> >> >> > 	* optabs.def (sadd_highpart_optab, uadd_highpart_optab): New.
> >> >> > 	* doc/md.texi: Document them.
> >> >> > 	* doc/rtl.texi: Likewise.
> >> >> > 	* target.def (can_special_div_by_const): Remove.
> >> >> > 	* target.h: Remove tree-core.h include
> >> >> > 	* targhooks.cc (default_can_special_div_by_const): Remove.
> >> >> > 	* targhooks.h (default_can_special_div_by_const): Remove.
> >> >> > 	* tree-vect-generic.cc (expand_vector_operation): Remove hook.
> >> >> > 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove
> >> >> > hook
> >> >> and
> >> >> > 	implement new obtab recognition based on range.
> >> >> > 	* tree-vect-stmts.cc (vectorizable_operation): Remove hook.
> >> >> >
> >> >> > gcc/testsuite/ChangeLog:
> >> >> >
> >> >> > 	PR target/108583
> >> >> > 	* gcc.dg/vect/vect-div-bitmask-4.c: New test.
> >> >> > 	* gcc.dg/vect/vect-div-bitmask-5.c: New test.
> >> >> >
> >> >> > --- inline copy of patch --
> >> >> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index
> >> >> >
> >> >>
> >>
> 7235d34c4b30949febfa10d5a626ac9358281cfa..02004c4b0f4d88dffe980f
> >> 74080
> >> >> 3
> >> >> > 8595e21af35d 100644
> >> >> > --- a/gcc/doc/md.texi
> >> >> > +++ b/gcc/doc/md.texi
> >> >> > @@ -5668,6 +5668,18 @@ represented in RTL using a
> >> >> @code{smul_highpart} RTX expression.
> >> >> >  Similar, but the multiplication is unsigned.  This may be
> >> >> > represented in RTL using an @code{umul_highpart} RTX expression.
> >> >> >
> >> >> > +@cindex @code{sadd@var{m}3_highpart} instruction pattern @item
> >> >> > +@samp{smul@var{m}3_highpart}
> >> >>
> >> >> sadd
> >> >>
> >> >> > +Perform a signed addition of operands 1 and 2, which have mode
> >> >> > +@var{m}, and store the most significant half of the product in
> >> >> > +operand
> >> 0.
> >> >> > +The least significant half of the product is discarded.  This
> >> >> > +may be represented in RTL using a @code{sadd_highpart} RTX
> expression.
> >> >> > +
> >> >> > +@cindex @code{uadd@var{m}3_highpart} instruction pattern @item
> >> >> > +@samp{uadd@var{m}3_highpart} Similar, but the addition is
> unsigned.
> >> >> > +This may be represented in RTL using an @code{uadd_highpart}
> >> >> > +RTX expression.
> >> >> > +
> >> >> >  @cindex @code{madd@var{m}@var{n}4} instruction pattern  @item
> >> >> > @samp{madd@var{m}@var{n}4}  Multiply operands 1 and 2, sign-
> >> extend
> >> >> > them to mode @var{n}, add diff --git a/gcc/doc/rtl.texi
> >> >> > b/gcc/doc/rtl.texi index
> >> >> >
> >> >>
> >>
> d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00
> >> 343
> >> >> d17
> >> >> > 1940ec4222f3 100644
> >> >> > --- a/gcc/doc/rtl.texi
> >> >> > +++ b/gcc/doc/rtl.texi
> >> >> > @@ -2535,6 +2535,17 @@ out in machine mode @var{m}.
> >> >> > @code{smul_highpart} returns the high part  of a signed
> >> >> > multiplication, @code{umul_highpart} returns the high part  of
> >> >> > an unsigned
> >> >> multiplication.
> >> >> >
> >> >> > +@findex sadd_highpart
> >> >> > +@findex uadd_highpart
> >> >> > +@cindex high-part addition
> >> >> > +@cindex addition high part
> >> >> > +@item (sadd_highpart:@var{m} @var{x} @var{y}) @itemx
> >> >> > +(uadd_highpart:@var{m} @var{x} @var{y}) Represents the
> >> >> > +high-part addition of @var{x} and @var{y} carried out in machine
> mode @var{m}.
> >> >> > +@code{sadd_highpart} returns the high part of a signed
> >> >> > +addition, @code{uadd_highpart} returns the high part of an unsigned
> addition.
> >> >>
> >> >> The patch doesn't add these RTL codes though.
> >> >>
> >> >> > +
> >> >> >  @findex fma
> >> >> >  @cindex fused multiply-add
> >> >> >  @item (fma:@var{m} @var{x} @var{y} @var{z}) diff --git
> >> >> > a/gcc/doc/tm.texi b/gcc/doc/tm.texi index
> >> >> >
> >> >>
> >>
> c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d57
> >> 914840
> >> >> 17e
> >> >> > 6b0d62ab077e 100644
> >> >> > --- a/gcc/doc/tm.texi
> >> >> > +++ b/gcc/doc/tm.texi
> >> >> > @@ -6137,22 +6137,6 @@ instruction pattern.  There is no need
> >> >> > for the hook to handle these two  implementation approaches itself.
> >> >> >  @end deftypefn
> >> >> >
> >> >> > -@deftypefn {Target Hook} bool
> >> >> > TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum
> >> >> @var{tree_code}, tree
> >> >> > @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx
> >> >> > @var{in0}, rtx @var{in1}) -This hook is used to test whether the
> >> >> > target has a special method of -division of vectors of type
> >> >> > @var{vectype}
> >> >> using the value @var{constant}, -and producing a vector of type
> >> >> @var{vectype}.  The division -will then not be decomposed by the
> >> >> vectorizer and kept as a div.
> >> >> > -
> >> >> > -When the hook is being used to test whether the target supports
> >> >> > a special -divide, @var{in0}, @var{in1}, and @var{output} are all null.
> >> >> > When the hook -is being used to emit a division, @var{in0} and
> >> >> > @var{in1} are the source -vectors of type @var{vecttype} and
> >> >> > @var{output} is the destination vector of -type @var{vectype}.
> >> >> > -
> >> >> > -Return true if the operation is possible, emitting instructions
> >> >> > for it -if rtxes are provided and updating @var{output}.
> >> >> > -@end deftypefn
> >> >> > -
> >> >> >  @deftypefn {Target Hook} tree
> >> >> > TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned
> >> >> @var{code},
> >> >> > tree @var{vec_type_out}, tree @var{vec_type_in})  This hook
> >> >> > should return the decl of a function that implements the
> >> >> > vectorized variant of the function with the @code{combined_fn}
> >> >> > code diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index
> >> >> >
> >> >>
> >>
> 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d
> >> 1efec0
> >> >> a3a
> >> >> > bccd1c293c7b 100644
> >> >> > --- a/gcc/doc/tm.texi.in
> >> >> > +++ b/gcc/doc/tm.texi.in
> >> >> > @@ -4173,8 +4173,6 @@ address;  but often a machine-dependent
> >> >> strategy can generate better code.
> >> >> >
> >> >> >  @hook TARGET_VECTORIZE_VEC_PERM_CONST
> >> >> >
> >> >> > -@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
> >> >> > -
> >> >> >  @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
> >> >> >
> >> >> >  @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
> >> >> > diff --git a/gcc/explow.cc b/gcc/explow.cc index
> >> >> >
> >> >>
> >>
> 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc
> >> 212f0
> >> >> bef
> >> >> > a016eea4573c 100644
> >> >> > --- a/gcc/explow.cc
> >> >> > +++ b/gcc/explow.cc
> >> >> > @@ -1037,7 +1037,7 @@ round_push (rtx size)
> >> >> >       TRUNC_DIV_EXPR.  */
> >> >> >    size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
> >> >> >  		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
> >> >> > -  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
> >> >> > size, align_rtx,
> >> >> > +  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size,
> >> >> > + align_rtx,
> >> >> >  			NULL_RTX, 1);
> >> >> >    size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
> >> >> >
> >> >> > @@ -1203,7 +1203,7 @@ align_dynamic_address (rtx target,
> >> >> > unsigned
> >> >> required_align)
> >> >> >  			 gen_int_mode (required_align /
> BITS_PER_UNIT - 1,
> >> >> >  				       Pmode),
> >> >> >  			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
> >> >> > -  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL,
> >> >> > target,
> >> >> > +  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
> >> >> >  			  gen_int_mode (required_align /
> BITS_PER_UNIT,
> >> >> >  					Pmode),
> >> >> >  			  NULL_RTX, 1);
> >> >> > diff --git a/gcc/expmed.h b/gcc/expmed.h index
> >> >> >
> >> >>
> >>
> 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c
> >> 5364
> >> >> 094
> >> >> > 1628068f3901 100644
> >> >> > --- a/gcc/expmed.h
> >> >> > +++ b/gcc/expmed.h
> >> >> > @@ -710,9 +710,8 @@ extern rtx expand_shift (enum tree_code,
> >> >> > machine_mode, rtx, poly_int64, rtx,  extern rtx
> >> >> > maybe_expand_shift
> >> >> (enum tree_code, machine_mode, rtx, int, rtx,
> >> >> >  			       int);
> >> >> >  #ifdef GCC_OPTABS_H
> >> >> > -extern rtx expand_divmod (int, enum tree_code, machine_mode,
> >> >> > tree,
> >> >> tree,
> >> >> > -			  rtx, rtx, rtx, int,
> >> >> > -			  enum optab_methods =
> OPTAB_LIB_WIDEN);
> >> >> > +extern rtx expand_divmod (int, enum tree_code, machine_mode,
> >> >> > +rtx,
> >> >> rtx,
> >> >> > +			  rtx, int, enum optab_methods =
> >> >> OPTAB_LIB_WIDEN);
> >> >> >  #endif
> >> >> >  #endif
> >> >> >
> >> >> > diff --git a/gcc/expmed.cc b/gcc/expmed.cc index
> >> >> >
> >> >>
> >>
> 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025a
> >> b18a3
> >> >> a59
> >> >> > c169d3b7692f 100644
> >> >> > --- a/gcc/expmed.cc
> >> >> > +++ b/gcc/expmed.cc
> >> >> > @@ -4222,8 +4222,8 @@ expand_sdiv_pow2 (scalar_int_mode
> mode,
> >> rtx
> >> >> op0,
> >> >> > HOST_WIDE_INT d)
> >> >> >
> >> >> >  rtx
> >> >> >  expand_divmod (int rem_flag, enum tree_code code, machine_mode
> >> >> mode,
> >> >> > -	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
> >> >> > -	       int unsignedp, enum optab_methods methods)
> >> >> > +	       rtx op0, rtx op1, rtx target, int unsignedp,
> >> >> > +	       enum optab_methods methods)
> >> >> >  {
> >> >> >    machine_mode compute_mode;
> >> >> >    rtx tquotient;
> >> >> > @@ -4375,17 +4375,6 @@ expand_divmod (int rem_flag, enum
> >> tree_code
> >> >> > code, machine_mode mode,
> >> >> >
> >> >> >    last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) :
> >> >> > 0;
> >> >> >
> >> >> > -  /* Check if the target has specific expansions for the division.
> >> >> > */
> >> >> > -  tree cst;
> >> >> > -  if (treeop0
> >> >> > -      && treeop1
> >> >> > -      && (cst = uniform_integer_cst_p (treeop1))
> >> >> > -      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE
> >> >> (treeop0),
> >> >> > -						     wi::to_wide (cst),
> >> >> > -						     &target, op0, op1))
> >> >> > -    return target;
> >> >> > -
> >> >> > -
> >> >> >    /* Now convert to the best mode to use.  */
> >> >> >    if (compute_mode != mode)
> >> >> >      {
> >> >> > @@ -4629,8 +4618,8 @@ expand_divmod (int rem_flag, enum
> >> tree_code
> >> >> code, machine_mode mode,
> >> >> >  			    || (optab_handler (sdivmod_optab,
> int_mode)
> >> >> >  				!= CODE_FOR_nothing)))
> >> >> >  		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
> >> >> > -						int_mode, treeop0,
> treeop1,
> >> >> > -						op0, gen_int_mode
> (abs_d,
> >> >> > +						int_mode, op0,
> >> >> > +						gen_int_mode
> (abs_d,
> >> >> >  							      int_mode),
> >> >> >  						NULL_RTX, 0);
> >> >> >  		    else
> >> >> > @@ -4819,8 +4808,8 @@ expand_divmod (int rem_flag, enum
> >> tree_code
> >> >> code, machine_mode mode,
> >> >> >  				      size - 1, NULL_RTX, 0);
> >> >> >  		t3 = force_operand (gen_rtx_MINUS (int_mode, t1,
> nsign),
> >> >> >  				    NULL_RTX);
> >> >> > -		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode,
> >> >> treeop0,
> >> >> > -				    treeop1, t3, op1, NULL_RTX, 0);
> >> >> > +		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode,
> t3,
> >> >> op1,
> >> >> > +				    NULL_RTX, 0);
> >> >> >  		if (t4)
> >> >> >  		  {
> >> >> >  		    rtx t5;
> >> >> > diff --git a/gcc/expr.cc b/gcc/expr.cc index
> >> >> >
> >> >>
> >>
> 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e75521633907
> >> 8d5b
> >> >> 2280
> >> >> > c6e277f26d72 100644
> >> >> > --- a/gcc/expr.cc
> >> >> > +++ b/gcc/expr.cc
> >> >> > @@ -8207,17 +8207,16 @@ force_operand (rtx value, rtx target)
> >> >> >  	    return expand_divmod (0,
> >> >> >  				  FLOAT_MODE_P (GET_MODE
> (value))
> >> >> >  				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
> >> >> > -				  GET_MODE (value), NULL, NULL,
> op1, op2,
> >> >> > -				  target, 0);
> >> >> > +				  GET_MODE (value), op1, op2, target,
> 0);
> >> >> >  	case MOD:
> >> >> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE
> (value),
> >> >> NULL, NULL,
> >> >> > -				op1, op2, target, 0);
> >> >> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE
> (value),
> >> >> op1, op2,
> >> >> > +				target, 0);
> >> >> >  	case UDIV:
> >> >> > -	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE
> (value),
> >> >> NULL, NULL,
> >> >> > -				op1, op2, target, 1);
> >> >> > +	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE
> (value),
> >> >> op1, op2,
> >> >> > +				target, 1);
> >> >> >  	case UMOD:
> >> >> > -	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE
> (value),
> >> >> NULL, NULL,
> >> >> > -				op1, op2, target, 1);
> >> >> > +	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE
> (value),
> >> >> op1, op2,
> >> >> > +				target, 1);
> >> >> >  	case ASHIFTRT:
> >> >> >  	  return expand_simple_binop (GET_MODE (value), code, op1,
> op2,
> >> >> >  				      target, 0, OPTAB_LIB_WIDEN); @@
> -
> >> >> 9170,13 +9169,11 @@
> >> >> > expand_expr_divmod (tree_code code, machine_mode mode, tree
> >> >> treeop0,
> >> >> >        bool speed_p = optimize_insn_for_speed_p ();
> >> >> >        do_pending_stack_adjust ();
> >> >> >        start_sequence ();
> >> >> > -      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0,
> treeop1,
> >> >> > -				   op0, op1, target, 1);
> >> >> > +      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1,
> >> >> > + target, 1);
> >> >> >        rtx_insn *uns_insns = get_insns ();
> >> >> >        end_sequence ();
> >> >> >        start_sequence ();
> >> >> > -      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0,
> treeop1,
> >> >> > -				   op0, op1, target, 0);
> >> >> > +      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1,
> >> >> > + target, 0);
> >> >> >        rtx_insn *sgn_insns = get_insns ();
> >> >> >        end_sequence ();
> >> >> >        unsigned uns_cost = seq_cost (uns_insns, speed_p); @@
> >> >> > -9198,8
> >> >> > +9195,7 @@ expand_expr_divmod (tree_code code, machine_mode
> >> >> mode, tree treeop0,
> >> >> >        emit_insn (sgn_insns);
> >> >> >        return sgn_ret;
> >> >> >      }
> >> >> > -  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
> >> >> > -			op0, op1, target, unsignedp);
> >> >> > +  return expand_divmod (mod_p, code, mode, op0, op1, target,
> >> >> > + unsignedp);
> >> >> >  }
> >> >> >
> >> >> >  rtx
> >> >> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
> >> >> >
> >> >>
> >>
> 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052
> >> 584f5a
> >> >> 3b
> >> >> > 8a734baa800f 100644
> >> >> > --- a/gcc/internal-fn.def
> >> >> > +++ b/gcc/internal-fn.def
> >> >> > @@ -174,6 +174,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN
> (AVG_CEIL,
> >> >> ECF_CONST
> >> >> > | ECF_NOTHROW, first,
> >> >> >
> >> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST |
> >> >> ECF_NOTHROW, first,
> >> >> >  			      smul_highpart, umul_highpart, binary)
> >> >> > +DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST |
> >> >> ECF_NOTHROW, first,
> >> >> > +			      sadd_highpart, uadd_highpart, binary)
> >> >> >  DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST |
> >> >> ECF_NOTHROW, first,
> >> >> >  			      smulhs, umulhs, binary)
> DEF_INTERNAL_SIGNED_OPTAB_FN
> >> >> > (MULHRS, ECF_CONST |
> >> >> ECF_NOTHROW, first,
> >> >> > diff --git a/gcc/optabs.cc b/gcc/optabs.cc index
> >> >> >
> >> >>
> >>
> cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe03
> >> 69a6
> >> >> e
> >> >> > 77082c1e617b 100644
> >> >> > --- a/gcc/optabs.cc
> >> >> > +++ b/gcc/optabs.cc
> >> >> > @@ -1106,9 +1106,8 @@ expand_doubleword_mod (machine_mode
> >> >> mode, rtx op0, rtx op1, bool unsignedp)
> >> >> >  		return NULL_RTX;
> >> >> >  	    }
> >> >> >  	}
> >> >> > -      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR,
> word_mode,
> >> >> NULL, NULL,
> >> >> > -				     sum, gen_int_mode (INTVAL (op1),
> >> >> > -							word_mode),
> >> >> > +      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR,
> >> word_mode,
> >> >> sum,
> >> >> > +				     gen_int_mode (INTVAL (op1),
> >> >> word_mode),
> >> >> >  				     NULL_RTX, 1, OPTAB_DIRECT);
> >> >> >        if (remainder == NULL_RTX)
> >> >> >  	return NULL_RTX;
> >> >> > @@ -1211,8 +1210,8 @@ expand_doubleword_divmod
> (machine_mode
> >> >> mode, rtx
> >> >> > op0, rtx op1, rtx *rem,
> >> >> >
> >> >> >    if (op11 != const1_rtx)
> >> >> >      {
> >> >> > -      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL,
> >> NULL,
> >> >> quot1,
> >> >> > -				op11, NULL_RTX, unsignedp,
> >> >> OPTAB_DIRECT);
> >> >> > +      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1,
> >> >> op11,
> >> >> > +				NULL_RTX, unsignedp,
> OPTAB_DIRECT);
> >> >> >        if (rem2 == NULL_RTX)
> >> >> >  	return NULL_RTX;
> >> >> >
> >> >> > @@ -1226,8 +1225,8 @@ expand_doubleword_divmod
> (machine_mode
> >> >> mode, rtx op0, rtx op1, rtx *rem,
> >> >> >        if (rem2 == NULL_RTX)
> >> >> >  	return NULL_RTX;
> >> >> >
> >> >> > -      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL,
> >> NULL,
> >> >> quot1,
> >> >> > -				 op11, NULL_RTX, unsignedp,
> >> >> OPTAB_DIRECT);
> >> >> > +      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode,
> >> >> > + quot1,
> >> op11,
> >> >> > +				 NULL_RTX, unsignedp,
> OPTAB_DIRECT);
> >> >> >        if (quot2 == NULL_RTX)
> >> >> >  	return NULL_RTX;
> >> >> >
> >> >> > diff --git a/gcc/optabs.def b/gcc/optabs.def index
> >> >> >
> >> >>
> >>
> 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2
> >> d7a5
> >> >> ccb
> >> >> > f6147947351a 100644
> >> >> > --- a/gcc/optabs.def
> >> >> > +++ b/gcc/optabs.def
> >> >> > @@ -265,6 +265,8 @@ OPTAB_D (spaceship_optab, "spaceship$a3")
> >> >> >
> >> >> >  OPTAB_D (smul_highpart_optab, "smul$a3_highpart")  OPTAB_D
> >> >> > (umul_highpart_optab, "umul$a3_highpart")
> >> >> > +OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart") OPTAB_D
> >> >> > +(uadd_highpart_optab, "uadd$a3_highpart")
> >> >> >
> >> >> >  OPTAB_D (cmpmem_optab, "cmpmem$a")  OPTAB_D (cmpstr_optab,
> >> >> > "cmpstr$a") diff --git a/gcc/target.def b/gcc/target.def index
> >> >> >
> >> >>
> >>
> db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed0
> >> 8d1d
> >> >> 81a
> >> >> > fa2c2baa64a5 100644
> >> >> > --- a/gcc/target.def
> >> >> > +++ b/gcc/target.def
> >> >> > @@ -1905,25 +1905,6 @@ implementation approaches itself.",
> >> >> >  	const vec_perm_indices &sel),
> >> >> >   NULL)
> >> >> >
> >> >> > -DEFHOOK
> >> >> > -(can_special_div_by_const,
> >> >> > - "This hook is used to test whether the target has a special
> >> >> > method of\n\ -division of vectors of type @var{vectype} using
> >> >> > the value @var{constant},\n\ -and producing a vector of type
> >> >> > @var{vectype}.  The division\n\ -will then not be decomposed by
> >> >> > the vectorizer and kept as a div.\n\ -\n\ -When the hook is
> >> >> > being used to test whether the target supports a special\n\
> >> >> > -divide, @var{in0}, @var{in1}, and @var{output} are all null.
> >> >> > When the hook\n\ -is being used to emit a division, @var{in0}
> >> >> > and @var{in1} are the source\n\ -vectors of type @var{vecttype}
> >> >> > and @var{output} is the destination vector of\n\ -type
> >> >> > @var{vectype}.\n\ -\n\ -Return true if the operation is
> >> >> > possible, emitting instructions for it\n\ -if rtxes are provided
> >> >> > and updating @var{output}.",
> >> >> > - bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
> >> >> > -	rtx in0, rtx in1),
> >> >> > - default_can_special_div_by_const)
> >> >> > -
> >> >> >  /* Return true if the target supports misaligned store/load of a
> >> >> >     specific factor denoted in the third parameter.  The last parameter
> >> >> >     is true if the access is defined in a packed struct.  */
> >> >> > diff --git a/gcc/target.h b/gcc/target.h index
> >> >> >
> >> >>
> >>
> 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fc
> >> a82b9
> >> >> 9f9
> >> >> > 13158c2d47b1 100644
> >> >> > --- a/gcc/target.h
> >> >> > +++ b/gcc/target.h
> >> >> > @@ -51,7 +51,6 @@
> >> >> >  #include "insn-codes.h"
> >> >> >  #include "tm.h"
> >> >> >  #include "hard-reg-set.h"
> >> >> > -#include "tree-core.h"
> >> >> >
> >> >> >  #if CHECKING_P
> >> >> >
> >> >> > diff --git a/gcc/targhooks.h b/gcc/targhooks.h index
> >> >> >
> >> >>
> >>
> a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad2
> >> 24454
> >> >> 93
> >> >> > 17a31390f0c2 100644
> >> >> > --- a/gcc/targhooks.h
> >> >> > +++ b/gcc/targhooks.h
> >> >> > @@ -209,8 +209,6 @@ extern void
> >> >> > default_addr_space_diagnose_usage (addr_space_t, location_t);
> >> >> > extern rtx default_addr_space_convert (rtx, tree, tree);  extern
> >> >> > unsigned int default_case_values_threshold (void);  extern bool
> >> >> > default_have_conditional_execution (void); -extern bool
> >> >> > default_can_special_div_by_const (enum tree_code, tree,
> >> >> wide_int,
> >> >> > -					      rtx *, rtx, rtx);
> >> >> >
> >> >> >  extern bool default_libc_has_function (enum function_class,
> >> >> > tree); extern bool default_libc_has_fast_function (int fcode);
> >> >> > diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc index
> >> >> >
> >> >>
> >>
> fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2
> >> da91e
> >> >> 03
> >> >> > 877337a931e7 100644
> >> >> > --- a/gcc/targhooks.cc
> >> >> > +++ b/gcc/targhooks.cc
> >> >> > @@ -1840,14 +1840,6 @@ default_have_conditional_execution
> (void)
> >> >> >    return HAVE_conditional_execution;  }
> >> >> >
> >> >> > -/* Default that no division by constant operations are special.
> >> >> > */ -bool -default_can_special_div_by_const (enum tree_code,
> >> >> > tree, wide_int, rtx *, rtx,
> >> >> > -				  rtx)
> >> >> > -{
> >> >> > -  return false;
> >> >> > -}
> >> >> > -
> >> >> >  /* By default we assume that c99 functions are present at the runtime,
> >> >> >     but sincos is not.  */
> >> >> >  bool
> >> >> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> >> >> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> >> >> > new file mode 100644
> >> >> > index
> >> >> >
> >> >>
> >>
> 0000000000000000000000000000000000000000..c81f8946922250234b
> >> f759e0a0
> >> >> a0
> >> >> > 4ea8c1f73e3c
> >> >> > --- /dev/null
> >> >> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
> >> >> > @@ -0,0 +1,25 @@
> >> >> > +/* { dg-require-effective-target vect_int } */
> >> >> > +
> >> >> > +#include <stdint.h>
> >> >> > +#include "tree-vect.h"
> >> >> > +
> >> >> > +typedef unsigned __attribute__((__vector_size__ (16))) V;
> >> >> > +
> >> >> > +static __attribute__((__noinline__))
> >> >> > +__attribute__((__noclone__)) V foo (V v, unsigned short i) {
> >> >> > +  v /= i;
> >> >> > +  return v;
> >> >> > +}
> >> >> > +
> >> >> > +int
> >> >> > +main (void)
> >> >> > +{
> >> >> > +  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff,
> >> >> > +0xffffffff }, 0xffff);
> >> >> > +  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
> >> >> > +    if (v[i] != 0x00010001)
> >> >> > +      __builtin_abort ();
> >> >> > +  return 0;
> >> >> > +}
> >> >> > +
> >> >> > +/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern:
> >> >> > +detected" "vect" { target aarch64*-*-* } } } */
> >> >> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> >> >> > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> >> >> > new file mode 100644
> >> >> > index
> >> >> >
> >> >>
> >>
> 0000000000000000000000000000000000000000..b4eb1a4dacba481e63
> >> 06b4991
> >> >> 4d2
> >> >> > a29b933de625
> >> >> > --- /dev/null
> >> >> > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
> >> >> > @@ -0,0 +1,58 @@
> >> >> > +/* { dg-require-effective-target vect_int } */
> >> >> > +
> >> >> > +#include <stdint.h>
> >> >> > +#include <stdio.h>
> >> >> > +#include "tree-vect.h"
> >> >> > +
> >> >> > +#define N 50
> >> >> > +#define TYPE uint8_t
> >> >> > +
> >> >> > +#ifndef DEBUG
> >> >> > +#define DEBUG 0
> >> >> > +#endif
> >> >> > +
> >> >> > +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> >> >> > +
> >> >> > +
> >> >> > +__attribute__((noipa, noinline, optimize("O1"))) void
> >> >> > +fun1(TYPE* restrict pixel, TYPE level, int n) {
> >> >> > +  for (int i = 0; i < n; i+=1)
> >> >> > +    pixel[i] = (pixel[i] + level) / 0xff; }
> >> >> > +
> >> >> > +__attribute__((noipa, noinline, optimize("O3"))) void
> >> >> > +fun2(TYPE* restrict pixel, TYPE level, int n) {
> >> >> > +  for (int i = 0; i < n; i+=1)
> >> >> > +    pixel[i] = (pixel[i] + level) / 0xff; }
> >> >> > +
> >> >> > +int main ()
> >> >> > +{
> >> >> > +  TYPE a[N];
> >> >> > +  TYPE b[N];
> >> >> > +
> >> >> > +  for (int i = 0; i < N; ++i)
> >> >> > +    {
> >> >> > +      a[i] = BASE + i * 13;
> >> >> > +      b[i] = BASE + i * 13;
> >> >> > +      if (DEBUG)
> >> >> > +        printf ("%d: 0x%x\n", i, a[i]);
> >> >> > +    }
> >> >> > +
> >> >> > +  fun1 (a, N / 2, N);
> >> >> > +  fun2 (b, N / 2, N);
> >> >> > +
> >> >> > +  for (int i = 0; i < N; ++i)
> >> >> > +    {
> >> >> > +      if (DEBUG)
> >> >> > +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> >> >> > +
> >> >> > +      if (a[i] != b[i])
> >> >> > +        __builtin_abort ();
> >> >> > +    }
> >> >> > +  return 0;
> >> >> > +}
> >> >> > +
> >> >> > +/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect"
> >> >> > +{ target aarch64*-*-* } } } */
> >> >> > diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
> >> >> > index
> >> >> >
> >> >>
> >>
> 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c
> >> 14077d
> >> >> c3
> >> >> > e970bed75ef6 100644
> >> >> > --- a/gcc/tree-vect-generic.cc
> >> >> > +++ b/gcc/tree-vect-generic.cc
> >> >> > @@ -1237,17 +1237,6 @@ expand_vector_operation
> >> >> (gimple_stmt_iterator *gsi, tree type, tree compute_type
> >> >> >  	  tree rhs2 = gimple_assign_rhs2 (assign);
> >> >> >  	  tree ret;
> >> >> >
> >> >> > -	  /* Check if the target was going to handle it through the
> special
> >> >> > -	     division callback hook.  */
> >> >> > -	  tree cst = uniform_integer_cst_p (rhs2);
> >> >> > -	  if (cst &&
> >> >> > -	      targetm.vectorize.can_special_div_by_const (code, type,
> >> >> > -							  wi::to_wide
> (cst),
> >> >> > -							  NULL,
> >> >> > -							  NULL_RTX,
> >> >> NULL_RTX))
> >> >> > -	    return NULL_TREE;
> >> >> > -
> >> >> > -
> >> >> >  	  if (!optimize
> >> >> >  	      || !VECTOR_INTEGER_TYPE_P (type)
> >> >> >  	      || TREE_CODE (rhs2) != VECTOR_CST diff --git
> >> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
> >> >> >
> >> >>
> >>
> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc46
> >> 7f33
> >> >> 69
> >> >> > de2afea139d6 100644
> >> >> > --- a/gcc/tree-vect-patterns.cc
> >> >> > +++ b/gcc/tree-vect-patterns.cc
> >> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info
> >> *vinfo,
> >> >> >        return pattern_stmt;
> >> >> >      }
> >> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
> >> >> > -	   && targetm.vectorize.can_special_div_by_const (rhs_code,
> >> >> vectype,
> >> >> > -							  wi::to_wide
> (cst),
> >> >> > -							  NULL,
> NULL_RTX,
> >> >> > -							  NULL_RTX))
> >> >> > +	   && TYPE_UNSIGNED (itype)
> >> >> > +	   && rhs_code == TRUNC_DIV_EXPR
> >> >> > +	   && vectype
> >> >> > +	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> >> >> > +					      OPTIMIZE_FOR_SPEED))
> >> >> >      {
> >> >> > -      return NULL;
> >> >> > +      /* div optimizations using narrowings
> >> >> > +       we can do the division e.g. shorts by 255 faster by calculating it as
> >> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> >> >> > +       double the precision of x.
> >> >> > +
> >> >> > +       If we imagine a short as being composed of two blocks of bytes
> then
> >> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is
> >> equivalent to
> >> >> > +       adding 1 to each sub component:
> >> >> > +
> >> >> > +	    short value of 16-bits
> >> >> > +       ┌──────────────┬────────────────┐
> >> >> > +       │              │                │
> >> >> > +       └──────────────┴────────────────┘
> >> >> > +	 8-bit part1 ▲  8-bit part2   ▲
> >> >> > +		     │                │
> >> >> > +		     │                │
> >> >> > +		    +1               +1
> >> >> > +
> >> >> > +       after the first addition, we have to shift right by 8, and narrow the
> >> >> > +       results back to a byte.  Remember that the addition must be done
> in
> >> >> > +       double the precision of the input.  However if we know
> >> >> > + that the
> >> >> addition
> >> >> > +       `x + 257` does not overflow then we can do the operation
> >> >> > + in the
> >> >> current
> >> >> > +       precision.  In which case we don't need the pack and unpacks.  */
> >> >> > +      auto wcst = wi::to_wide (cst);
> >> >> > +      int pow = wi::exact_log2 (wcst + 1);
> >> >> > +      if (pow == (int) (element_precision (vectype) / 2))
> >> >> > +	{
> >> >> > +	  wide_int min,max;
> >> >> > +	  /* If we're in a pattern we need to find the orginal definition.
> */
> >> >> > +	  tree op0 = oprnd0;
> >> >> > +	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> >> >> > +	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> >> >> > +	  if (is_pattern_stmt_p (stmt_info))
> >> >> > +	    {
> >> >> > +	      auto orig_stmt = STMT_VINFO_RELATED_STMT
> (stmt_info);
> >> >> > +	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> >> >> > +		op0 = gimple_assign_lhs (STMT_VINFO_STMT
> (orig_stmt));
> >> >> > +	    }
> >> >>
> >> >> If this is generally safe (I'm skipping thinking about it in the
> >> >> interests of a quick review :-)), then I think it should be done
> >> >> in vect_get_range_info instead.  Using gimple_get_lhs would be
> >> >> more general than handling just assignments.
> >> >>
> >> >> > +
> >> >> > +	  /* Check that no overflow will occur.  If we don't have range
> >> >> > +	     information we can't perform the optimization.  */
> >> >> > +	  if (vect_get_range_info (op0, &min, &max))
> >> >> > +	    {
> >> >> > +	      wide_int one = wi::to_wide (build_one_cst (itype));
> >> >> > +	      wide_int adder = wi::add (one, wi::lshift (one, pow));
> >> >> > +	      wi::overflow_type ovf;
> >> >> > +	      /* We need adder and max in the same precision.  */
> >> >> > +	      wide_int zadder
> >> >> > +		= wide_int_storage::from (adder, wi::get_precision
> (max),
> >> >> > +					  UNSIGNED);
> >> >> > +	      wi::add (max, zadder, UNSIGNED, &ovf);
> >> >>
> >> >> Could you explain this a bit more?  When do we have mismatched
> >> >> precisions?
> >> >
> >> > C promotion rules will promote e.g.
> >> >
> >> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
> >> >   for (int i = 0; i < n; i+=1)
> >> >     pixel[i] = (pixel[i] + level) / 0xff; }
> >> >
> >> > And have the addition be done as a 32 bit integer.  The vectorizer
> >> > will demote this down to a short, but range information is not
> >> > stored for patterns.  So In the above the range will correctly be
> >> > 0x1fe but the precision will be that of the original expression, so
> >> > 32.  This will be a mismatch with itype which is derived from the
> >> > size the vectorizer
> >> will perform the operation in.
> >> >
> >> > Thanks,
> >> > Tamar
> >> >
> >> >>
> >> >> Thanks,
> >> >> Richard
> >> >>
> >> >> > +	      if (ovf == wi::OVF_NONE)
> >> >> > +		{
> >> >> > +		  *type_out = vectype;
> >> >> > +		  tree tadder = wide_int_to_tree (itype, adder);
> >> >> > +		  gcall *patt1
> >> >> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0,
> >> >> tadder);
> >> >> > +		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
> >> >> > +		  gimple_call_set_lhs (patt1, lhs);
> >> >> > +		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1,
> >> >> vectype);
> >> >> > +
> >> >> > +		  pattern_stmt
> >> >> > +		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0,
> lhs);
> >> >> > +		  lhs = vect_recog_temp_ssa_var (itype, NULL);
> >> >> > +		  gimple_call_set_lhs (pattern_stmt, lhs);
> >> >> > +
> >> >> > +		  return pattern_stmt;
> >> >> > +		}
> >> >> > +	    }
> >> >> > +	}
> >> >> >      }
> >> >> >
> >> >> >    if (prec > HOST_BITS_PER_WIDE_INT diff --git
> >> >> > a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index
> >> >> >
> >> >>
> >>
> eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0
> >> b95
> >> >> 64f
> >> >> > c4e066e50081 100644
> >> >> > --- a/gcc/tree-vect-stmts.cc
> >> >> > +++ b/gcc/tree-vect-stmts.cc
> >> >> > @@ -6263,15 +6263,6 @@ vectorizable_operation (vec_info *vinfo,
> >> >> >  	}
> >> >> >        target_support_p = (optab_handler (optab, vec_mode)
> >> >> >  			  != CODE_FOR_nothing);
> >> >> > -      tree cst;
> >> >> > -      if (!target_support_p
> >> >> > -	  && op1
> >> >> > -	  && (cst = uniform_integer_cst_p (op1)))
> >> >> > -	target_support_p
> >> >> > -	  = targetm.vectorize.can_special_div_by_const (code,
> vectype,
> >> >> > -							wi::to_wide
> (cst),
> >> >> > -							NULL,
> NULL_RTX,
> >> >> > -							NULL_RTX);
> >> >> >      }
> >> >> >
> >> >> >    bool using_emulated_vectors_p = vect_emulated_vector_p
> >> >> > (vectype);
Richard Sandiford Feb. 27, 2023, 9:33 p.m. UTC | #36
Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Monday, February 27, 2023 12:12 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
>> by using new optabs [PR108583]
>> 
>> Tamar Christina <Tamar.Christina@arm.com> writes:
>> > Hi,
>> >
>> >> > I avoided open coding it with add and shift because it creates a 4
>> >> > instructions (and shifts which are typically slow) dependency chain
>> >> > instead of a load and multiply.  This change, unless the target is
>> >> > known to optimize it further is unlikely to be beneficial.  And by
>> >> > the time we get to costing the only alternative is to undo the
>> >> > existing pattern and
>> >> so you lose the general shift optimization.
>> >> >
>> >> > So it seemed unwise to open code as shifts, given the codegen out
>> >> > of the vectorizer would be degenerate for most targets or one needs
>> >> > the more complicated route of costing during pattern matching already.
>> >>
>> >> Hmm, OK.  That seems like a cost-model thing though, rather than
>> >> something that should be exposed through optabs.  And I imagine the
>> >> open-coded version would still be better than nothing on targets without
>> highpart multiply.
>> >>
>> >> So how about replacing the hook with one that simply asks whether
>> >> division through highpart multiplication is preferred over the add/shift
>> sequence?
>> >> (Unfortunately it's not going to be possible to work that out from
>> >> existing
>> >> information.)
>> >
>> > So this doesn't work for SVE.  For SVE the multiplication widening
>> > pass introduces FMAs at gimple level.  So in the cases where the
>> > operation is fed from a widening multiplication we end up generating FMA.
>> If that was it I could have matched FMA.
>> >
>> > But it also pushes the multiplication in the second operand because it
>> > no longer has a mul to share the results with.
>> >
>> > In any case, the gimple code is transformed into
>> >
>> > vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121);
>> > vect_patt_57.9_123 = (vector([8,8]) unsigned short) vect__3.8_122;
>> > vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, { 257,
>> > ... });
>> > vect_patt_65.12_128 = vect_patt_64.11_127 >> 8;
>> > vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124,
>> > vect_patt_65.12_128);
>> > vect_patt_62.14_130 = vect_patt_66.13_129 >> 8;
>> > vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21)
>> > vect_patt_62.14_130;
>> >
>> > This transformation is much worse than the original code, it extended
>> > the dependency chain with another expensive instruction. I can try to
>> > correct this in RTL by matching FMA and shift and splitting into MUL +
>> ADDHNB and hope CSE takes care of the extra mul.
>> >
>> > But this seems like a hack, and it's basically undoing the earlier
>> > transformation.  It seems to me that the open coding is a bad idea.
>> 
>> Could you post the patch that gives this result?  I'll have a poke around.
>
> Sure, I'll post the new series, it needs all of them.

Thanks.  Which testcase did you use to get the above?

But since SVE does have highpart multiply, and since the assumption for
SVE is that MULH+shift is better than ADD*3+shift*2, shouldn't SVE just
be one of the targets for which the hook that "asks whether division
through highpart multiplication is preferred over the add/shift
sequence" returns true?

For extra conservativeness, we could make the hook default to true
and explicitly return false for Advanced SIMD and for SVE2.

Richard
Tamar Christina Feb. 27, 2023, 10:10 p.m. UTC | #37
> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Monday, February 27, 2023 9:33 PM
> To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>
> Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>;
> rguenther@suse.de; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> Sent: Monday, February 27, 2023 12:12 PM
> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
> >> div-bitmask by using new optabs [PR108583]
> >>
> >> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> > Hi,
> >> >
> >> >> > I avoided open coding it with add and shift because it creates a
> >> >> > 4 instructions (and shifts which are typically slow) dependency
> >> >> > chain instead of a load and multiply.  This change, unless the
> >> >> > target is known to optimize it further is unlikely to be
> >> >> > beneficial.  And by the time we get to costing the only
> >> >> > alternative is to undo the existing pattern and
> >> >> so you lose the general shift optimization.
> >> >> >
> >> >> > So it seemed unwise to open code as shifts, given the codegen
> >> >> > out of the vectorizer would be degenerate for most targets or
> >> >> > one needs the more complicated route of costing during pattern
> matching already.
> >> >>
> >> >> Hmm, OK.  That seems like a cost-model thing though, rather than
> >> >> something that should be exposed through optabs.  And I imagine
> >> >> the open-coded version would still be better than nothing on
> >> >> targets without
> >> highpart multiply.
> >> >>
> >> >> So how about replacing the hook with one that simply asks whether
> >> >> division through highpart multiplication is preferred over the
> >> >> add/shift
> >> sequence?
> >> >> (Unfortunately it's not going to be possible to work that out from
> >> >> existing
> >> >> information.)
> >> >
> >> > So this doesn't work for SVE.  For SVE the multiplication widening
> >> > pass introduces FMAs at gimple level.  So in the cases where the
> >> > operation is fed from a widening multiplication we end up generating
> FMA.
> >> If that was it I could have matched FMA.
> >> >
> >> > But it also pushes the multiplication in the second operand because
> >> > it no longer has a mul to share the results with.
> >> >
> >> > In any case, the gimple code is transformed into
> >> >
> >> > vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121);
> >> > vect_patt_57.9_123 = (vector([8,8]) unsigned short) vect__3.8_122;
> >> > vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, {
> >> > 257, ... });
> >> > vect_patt_65.12_128 = vect_patt_64.11_127 >> 8;
> >> > vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124,
> >> > vect_patt_65.12_128);
> >> > vect_patt_62.14_130 = vect_patt_66.13_129 >> 8;
> >> > vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21)
> >> > vect_patt_62.14_130;
> >> >
> >> > This transformation is much worse than the original code, it
> >> > extended the dependency chain with another expensive instruction. I
> >> > can try to correct this in RTL by matching FMA and shift and
> >> > splitting into MUL +
> >> ADDHNB and hope CSE takes care of the extra mul.
> >> >
> >> > But this seems like a hack, and it's basically undoing the earlier
> >> > transformation.  It seems to me that the open coding is a bad idea.
> >>
> >> Could you post the patch that gives this result?  I'll have a poke around.
> >
> > Sure, I'll post the new series, it needs all of them.
> 
> Thanks.  Which testcase did you use to get the above?
> 

#include <stdint.h>

#define N 16
#define TYPE uint8_t

void fun3(TYPE* restrict pixel, TYPE level, int n)
{
  for (int i = 0; i < (n & -16); i+=1)
    pixel[i] = (pixel[i] * level) / 0xff;
}

> But since SVE does have highpart multiply, and since the assumption for SVE is
> that MULH+shift is better than ADD*3+shift*2, shouldn't SVE just be one of
> the targets for which the hook that "asks whether division through highpart
> multiplication is preferred over the add/shift sequence" returns true?
> 

Yes (it's also two adds not 3), but it's not correct for SVE2, which has addhnb, in which case 2x addhnb is
much faster than MULH+shift.  And the problem is that widening_mul will not
allow add+shift to reach the backend because the ADD+shift were open coded.

They are now subjected to further optimization.

To summarize:

Other targets: false
SVE: false
SVE2: true
NEON: true

SVE2 borked because MUL+ADD+SHIFT -> FMA+SHIFT.

If you're saying you don't want the optimization for SVE2, then sure, happy to turn it off.

But  UMULH+LSR == 6 cycles on Neoverse-N2 and throughput of 1.
2x ADDHNB = 4 cycles and throughput of 2.

Tamar.

> 
> Richard
Richard Sandiford Feb. 28, 2023, 11:08 a.m. UTC | #38
Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Monday, February 27, 2023 9:33 PM
>> To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>
>> Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>;
>> rguenther@suse.de; jlaw@ventanamicro.com
>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
>> by using new optabs [PR108583]
>> 
>> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> >> -----Original Message-----
>> >> From: Richard Sandiford <richard.sandiford@arm.com>
>> >> Sent: Monday, February 27, 2023 12:12 PM
>> >> To: Tamar Christina <Tamar.Christina@arm.com>
>> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
>> >> div-bitmask by using new optabs [PR108583]
>> >>
>> >> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> > Hi,
>> >> >
>> >> >> > I avoided open coding it with add and shift because it creates a
>> >> >> > 4 instructions (and shifts which are typically slow) dependency
>> >> >> > chain instead of a load and multiply.  This change, unless the
>> >> >> > target is known to optimize it further is unlikely to be
>> >> >> > beneficial.  And by the time we get to costing the only
>> >> >> > alternative is to undo the existing pattern and
>> >> >> so you lose the general shift optimization.
>> >> >> >
>> >> >> > So it seemed unwise to open code as shifts, given the codegen
>> >> >> > out of the vectorizer would be degenerate for most targets or
>> >> >> > one needs the more complicated route of costing during pattern
>> matching already.
>> >> >>
>> >> >> Hmm, OK.  That seems like a cost-model thing though, rather than
>> >> >> something that should be exposed through optabs.  And I imagine
>> >> >> the open-coded version would still be better than nothing on
>> >> >> targets without
>> >> highpart multiply.
>> >> >>
>> >> >> So how about replacing the hook with one that simply asks whether
>> >> >> division through highpart multiplication is preferred over the
>> >> >> add/shift
>> >> sequence?
>> >> >> (Unfortunately it's not going to be possible to work that out from
>> >> >> existing
>> >> >> information.)
>> >> >
>> >> > So this doesn't work for SVE.  For SVE the multiplication widening
>> >> > pass introduces FMAs at gimple level.  So in the cases where the
>> >> > operation is fed from a widening multiplication we end up generating
>> FMA.
>> >> If that was it I could have matched FMA.
>> >> >
>> >> > But it also pushes the multiplication in the second operand because
>> >> > it no longer has a mul to share the results with.
>> >> >
>> >> > In any case, the gimple code is transformed into
>> >> >
>> >> > vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121);
>> >> > vect_patt_57.9_123 = (vector([8,8]) unsigned short) vect__3.8_122;
>> >> > vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, {
>> >> > 257, ... });
>> >> > vect_patt_65.12_128 = vect_patt_64.11_127 >> 8;
>> >> > vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124,
>> >> > vect_patt_65.12_128);
>> >> > vect_patt_62.14_130 = vect_patt_66.13_129 >> 8;
>> >> > vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21)
>> >> > vect_patt_62.14_130;
>> >> >
>> >> > This transformation is much worse than the original code, it
>> >> > extended the dependency chain with another expensive instruction. I
>> >> > can try to correct this in RTL by matching FMA and shift and
>> >> > splitting into MUL +
>> >> ADDHNB and hope CSE takes care of the extra mul.
>> >> >
>> >> > But this seems like a hack, and it's basically undoing the earlier
>> >> > transformation.  It seems to me that the open coding is a bad idea.
>> >>
>> >> Could you post the patch that gives this result?  I'll have a poke around.
>> >
>> > Sure, I'll post the new series, it needs all of them.
>> 
>> Thanks.  Which testcase did you use to get the above?
>> 
>
> #include <stdint.h>
>
> #define N 16
> #define TYPE uint8_t
>
> void fun3(TYPE* restrict pixel, TYPE level, int n)
> {
>   for (int i = 0; i < (n & -16); i+=1)
>     pixel[i] = (pixel[i] * level) / 0xff;
> }

Thanks.  In that testcase, isn't the FMA handling an anti-optimisation
in its own right though?  It's duplicating a multiplication into two
points on a dependency chain.

E.g. for:

unsigned int
f1 (unsigned int a, unsigned int b, unsigned int c)
{
  unsigned int d = a * b;
  return d + ((c + d) >> 1);
}
unsigned int
g1 (unsigned int a, unsigned int b, unsigned int c)
{
  return a * b + c;
}

__Uint32x4_t
f2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c)
{
  __Uint32x4_t d = a * b;
  return d + ((c + d) >> 1);
}
__Uint32x4_t
g2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c)
{
  return a * b + c;
}

typedef unsigned int vec __attribute__((vector_size(32)));
vec
f3 (vec a, vec b, vec c)
{
  vec d = a * b;
  return d + ((c + d) >> 1);
}
vec
g3 (vec a, vec b, vec c)
{
  return a * b + c;
}

compiled with -O2 -msve-vector-bits=256 -march=armv8.2-a+sve,
all the g functions use multiply-add (as expected), but the
f functions are:

f1:
        mul     w1, w0, w1
        add     w0, w1, w2
        add     w0, w1, w0, lsr 1
        ret

f2:
        mul     v0.4s, v0.4s, v1.4s
        add     v2.4s, v0.4s, v2.4s
        usra    v0.4s, v2.4s, 1
        ret

f3:
        ...
        mla     z0.s, p0/m, z1.s, z2.s
        lsr     z0.s, z0.s, #1
        mad     z1.s, p0/m, z2.s, z0.s
        ...

What we do for f3 doesn't seem like a good idea.

I can see that duplicating an integer multiplication might make sense if
the integer FMAs are done in parallel.  But if one is a dependency of
the other, then at least for integer FMA, I think we should punt,
especially since we don't know what the target's late-forwarding
restrictions are.  I guess fp-contract comes into play for the
FP FMAs though.

>> But since SVE does have highpart multiply, and since the assumption for SVE is
>> that MULH+shift is better than ADD*3+shift*2, shouldn't SVE just be one of
>> the targets for which the hook that "asks whether division through highpart
>> multiplication is preferred over the add/shift sequence" returns true?
>> 
>
> Yes (it's also two adds not 3), but it's not correct for SVE2, which has addhnb, in which case 2x addhnb is
> much faster than MULH+shift.  And the problem is that widening_mul will not
> allow add+shift to reach the backend because the ADD+shift were open coded.
>
> They are now subjected to further optimization.
>
> To summarize:
>
> Other targets: false
> SVE: false
> SVE2: true
> NEON: true

Yeah, looks good.

> SVE2 borked because MUL+ADD+SHIFT -> FMA+SHIFT.
>
> If you're saying you don't want the optimization for SVE2, then sure, happy to turn it off.
>
> But  UMULH+LSR == 6 cycles on Neoverse-N2 and throughput of 1.
> 2x ADDHNB = 4 cycles and throughput of 2.

No, I meant the same as what you said in the summary above.

Richard
Tamar Christina Feb. 28, 2023, 11:12 a.m. UTC | #39
> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Tuesday, February 28, 2023 11:09 AM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> Sent: Monday, February 27, 2023 9:33 PM
> >> To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>
> >> Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>;
> >> rguenther@suse.de; jlaw@ventanamicro.com
> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
> >> div-bitmask by using new optabs [PR108583]
> >>
> >> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> >> -----Original Message-----
> >> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> >> Sent: Monday, February 27, 2023 12:12 PM
> >> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> >> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> >> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
> >> >> div-bitmask by using new optabs [PR108583]
> >> >>
> >> >> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> > Hi,
> >> >> >
> >> >> >> > I avoided open coding it with add and shift because it
> >> >> >> > creates a
> >> >> >> > 4 instructions (and shifts which are typically slow)
> >> >> >> > dependency chain instead of a load and multiply.  This
> >> >> >> > change, unless the target is known to optimize it further is
> >> >> >> > unlikely to be beneficial.  And by the time we get to costing
> >> >> >> > the only alternative is to undo the existing pattern and
> >> >> >> so you lose the general shift optimization.
> >> >> >> >
> >> >> >> > So it seemed unwise to open code as shifts, given the codegen
> >> >> >> > out of the vectorizer would be degenerate for most targets or
> >> >> >> > one needs the more complicated route of costing during
> >> >> >> > pattern
> >> matching already.
> >> >> >>
> >> >> >> Hmm, OK.  That seems like a cost-model thing though, rather
> >> >> >> than something that should be exposed through optabs.  And I
> >> >> >> imagine the open-coded version would still be better than
> >> >> >> nothing on targets without
> >> >> highpart multiply.
> >> >> >>
> >> >> >> So how about replacing the hook with one that simply asks
> >> >> >> whether division through highpart multiplication is preferred
> >> >> >> over the add/shift
> >> >> sequence?
> >> >> >> (Unfortunately it's not going to be possible to work that out
> >> >> >> from existing
> >> >> >> information.)
> >> >> >
> >> >> > So this doesn't work for SVE.  For SVE the multiplication
> >> >> > widening pass introduces FMAs at gimple level.  So in the cases
> >> >> > where the operation is fed from a widening multiplication we end
> >> >> > up generating
> >> FMA.
> >> >> If that was it I could have matched FMA.
> >> >> >
> >> >> > But it also pushes the multiplication in the second operand
> >> >> > because it no longer has a mul to share the results with.
> >> >> >
> >> >> > In any case, the gimple code is transformed into
> >> >> >
> >> >> > vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121);
> >> >> > vect_patt_57.9_123 = (vector([8,8]) unsigned short)
> >> >> > vect__3.8_122;
> >> >> > vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, {
> >> >> > 257, ... });
> >> >> > vect_patt_65.12_128 = vect_patt_64.11_127 >> 8;
> >> >> > vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124,
> >> >> > vect_patt_65.12_128);
> >> >> > vect_patt_62.14_130 = vect_patt_66.13_129 >> 8;
> >> >> > vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21)
> >> >> > vect_patt_62.14_130;
> >> >> >
> >> >> > This transformation is much worse than the original code, it
> >> >> > extended the dependency chain with another expensive
> >> >> > instruction. I can try to correct this in RTL by matching FMA
> >> >> > and shift and splitting into MUL +
> >> >> ADDHNB and hope CSE takes care of the extra mul.
> >> >> >
> >> >> > But this seems like a hack, and it's basically undoing the
> >> >> > earlier transformation.  It seems to me that the open coding is a bad
> idea.
> >> >>
> >> >> Could you post the patch that gives this result?  I'll have a poke around.
> >> >
> >> > Sure, I'll post the new series, it needs all of them.
> >>
> >> Thanks.  Which testcase did you use to get the above?
> >>
> >
> > #include <stdint.h>
> >
> > #define N 16
> > #define TYPE uint8_t
> >
> > void fun3(TYPE* restrict pixel, TYPE level, int n) {
> >   for (int i = 0; i < (n & -16); i+=1)
> >     pixel[i] = (pixel[i] * level) / 0xff; }
> 
> Thanks.  In that testcase, isn't the FMA handling an anti-optimisation in its
> own right though?  It's duplicating a multiplication into two points on a
> dependency chain.

Most definitely, that's what I meant above. The "optimization" doesn't take into
account the effect on the rest of the chain.

> 
> E.g. for:
> 
> unsigned int
> f1 (unsigned int a, unsigned int b, unsigned int c) {
>   unsigned int d = a * b;
>   return d + ((c + d) >> 1);
> }
> unsigned int
> g1 (unsigned int a, unsigned int b, unsigned int c) {
>   return a * b + c;
> }
> 
> __Uint32x4_t
> f2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c) {
>   __Uint32x4_t d = a * b;
>   return d + ((c + d) >> 1);
> }
> __Uint32x4_t
> g2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c) {
>   return a * b + c;
> }
> 
> typedef unsigned int vec __attribute__((vector_size(32))); vec
> f3 (vec a, vec b, vec c)
> {
>   vec d = a * b;
>   return d + ((c + d) >> 1);
> }
> vec
> g3 (vec a, vec b, vec c)
> {
>   return a * b + c;
> }
> 
> compiled with -O2 -msve-vector-bits=256 -march=armv8.2-a+sve, all the g
> functions use multiply-add (as expected), but the f functions are:
> 
> f1:
>         mul     w1, w0, w1
>         add     w0, w1, w2
>         add     w0, w1, w0, lsr 1
>         ret
> 
> f2:
>         mul     v0.4s, v0.4s, v1.4s
>         add     v2.4s, v0.4s, v2.4s
>         usra    v0.4s, v2.4s, 1
>         ret
> 
> f3:
>         ...
>         mla     z0.s, p0/m, z1.s, z2.s
>         lsr     z0.s, z0.s, #1
>         mad     z1.s, p0/m, z2.s, z0.s
>         ...
> 
> What we do for f3 doesn't seem like a good idea.

Agreed,  I guess this means I have to fix that as well? ☹

I'll go take a look then..

Tamar.

> 
> I can see that duplicating an integer multiplication might make sense if the
> integer FMAs are done in parallel.  But if one is a dependency of the other,
> then at least for integer FMA, I think we should punt, especially since we don't
> know what the target's late-forwarding restrictions are.  I guess fp-contract
> comes into play for the FP FMAs though.
> 
> >> But since SVE does have highpart multiply, and since the assumption
> >> for SVE is that MULH+shift is better than ADD*3+shift*2, shouldn't
> >> SVE just be one of the targets for which the hook that "asks whether
> >> division through highpart multiplication is preferred over the add/shift
> sequence" returns true?
> >>
> >
> > Yes (it's also two adds not 3), but it's not correct for SVE2, which
> > has addhnb, in which case 2x addhnb is much faster than MULH+shift.
> > And the problem is that widening_mul will not allow add+shift to reach the
> backend because the ADD+shift were open coded.
> >
> > They are now subjected to further optimization.
> >
> > To summarize:
> >
> > Other targets: false
> > SVE: false
> > SVE2: true
> > NEON: true
> 
> Yeah, looks good.
> 
> > SVE2 borked because MUL+ADD+SHIFT -> FMA+SHIFT.
> >
> > If you're saying you don't want the optimization for SVE2, then sure, happy
> to turn it off.
> >
> > But  UMULH+LSR == 6 cycles on Neoverse-N2 and throughput of 1.
> > 2x ADDHNB = 4 cycles and throughput of 2.
> 
> No, I meant the same as what you said in the summary above.
> 
> Richard
Richard Sandiford Feb. 28, 2023, 12:03 p.m. UTC | #40
Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Tuesday, February 28, 2023 11:09 AM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
>> by using new optabs [PR108583]
>> 
>> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> -----Original Message-----
>> >> From: Richard Sandiford <richard.sandiford@arm.com>
>> >> Sent: Monday, February 27, 2023 9:33 PM
>> >> To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>
>> >> Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>;
>> >> rguenther@suse.de; jlaw@ventanamicro.com
>> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
>> >> div-bitmask by using new optabs [PR108583]
>> >>
>> >> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> >> >> -----Original Message-----
>> >> >> From: Richard Sandiford <richard.sandiford@arm.com>
>> >> >> Sent: Monday, February 27, 2023 12:12 PM
>> >> >> To: Tamar Christina <Tamar.Christina@arm.com>
>> >> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
>> >> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
>> >> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
>> >> >> div-bitmask by using new optabs [PR108583]
>> >> >>
>> >> >> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> >> > Hi,
>> >> >> >
>> >> >> >> > I avoided open coding it with add and shift because it
>> >> >> >> > creates a
>> >> >> >> > 4 instructions (and shifts which are typically slow)
>> >> >> >> > dependency chain instead of a load and multiply.  This
>> >> >> >> > change, unless the target is known to optimize it further is
>> >> >> >> > unlikely to be beneficial.  And by the time we get to costing
>> >> >> >> > the only alternative is to undo the existing pattern and
>> >> >> >> so you lose the general shift optimization.
>> >> >> >> >
>> >> >> >> > So it seemed unwise to open code as shifts, given the codegen
>> >> >> >> > out of the vectorizer would be degenerate for most targets or
>> >> >> >> > one needs the more complicated route of costing during
>> >> >> >> > pattern
>> >> matching already.
>> >> >> >>
>> >> >> >> Hmm, OK.  That seems like a cost-model thing though, rather
>> >> >> >> than something that should be exposed through optabs.  And I
>> >> >> >> imagine the open-coded version would still be better than
>> >> >> >> nothing on targets without
>> >> >> highpart multiply.
>> >> >> >>
>> >> >> >> So how about replacing the hook with one that simply asks
>> >> >> >> whether division through highpart multiplication is preferred
>> >> >> >> over the add/shift
>> >> >> sequence?
>> >> >> >> (Unfortunately it's not going to be possible to work that out
>> >> >> >> from existing
>> >> >> >> information.)
>> >> >> >
>> >> >> > So this doesn't work for SVE.  For SVE the multiplication
>> >> >> > widening pass introduces FMAs at gimple level.  So in the cases
>> >> >> > where the operation is fed from a widening multiplication we end
>> >> >> > up generating
>> >> FMA.
>> >> >> If that was it I could have matched FMA.
>> >> >> >
>> >> >> > But it also pushes the multiplication in the second operand
>> >> >> > because it no longer has a mul to share the results with.
>> >> >> >
>> >> >> > In any case, the gimple code is transformed into
>> >> >> >
>> >> >> > vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121);
>> >> >> > vect_patt_57.9_123 = (vector([8,8]) unsigned short)
>> >> >> > vect__3.8_122;
>> >> >> > vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, {
>> >> >> > 257, ... });
>> >> >> > vect_patt_65.12_128 = vect_patt_64.11_127 >> 8;
>> >> >> > vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124,
>> >> >> > vect_patt_65.12_128);
>> >> >> > vect_patt_62.14_130 = vect_patt_66.13_129 >> 8;
>> >> >> > vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21)
>> >> >> > vect_patt_62.14_130;
>> >> >> >
>> >> >> > This transformation is much worse than the original code, it
>> >> >> > extended the dependency chain with another expensive
>> >> >> > instruction. I can try to correct this in RTL by matching FMA
>> >> >> > and shift and splitting into MUL +
>> >> >> ADDHNB and hope CSE takes care of the extra mul.
>> >> >> >
>> >> >> > But this seems like a hack, and it's basically undoing the
>> >> >> > earlier transformation.  It seems to me that the open coding is a bad
>> idea.
>> >> >>
>> >> >> Could you post the patch that gives this result?  I'll have a poke around.
>> >> >
>> >> > Sure, I'll post the new series, it needs all of them.
>> >>
>> >> Thanks.  Which testcase did you use to get the above?
>> >>
>> >
>> > #include <stdint.h>
>> >
>> > #define N 16
>> > #define TYPE uint8_t
>> >
>> > void fun3(TYPE* restrict pixel, TYPE level, int n) {
>> >   for (int i = 0; i < (n & -16); i+=1)
>> >     pixel[i] = (pixel[i] * level) / 0xff; }
>> 
>> Thanks.  In that testcase, isn't the FMA handling an anti-optimisation in its
>> own right though?  It's duplicating a multiplication into two points on a
>> dependency chain.
>
> Most definitely, that's what I meant above. The "optimization" doesn't take into
> account the effect on the rest of the chain.
>
>> 
>> E.g. for:
>> 
>> unsigned int
>> f1 (unsigned int a, unsigned int b, unsigned int c) {
>>   unsigned int d = a * b;
>>   return d + ((c + d) >> 1);
>> }
>> unsigned int
>> g1 (unsigned int a, unsigned int b, unsigned int c) {
>>   return a * b + c;
>> }
>> 
>> __Uint32x4_t
>> f2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c) {
>>   __Uint32x4_t d = a * b;
>>   return d + ((c + d) >> 1);
>> }
>> __Uint32x4_t
>> g2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c) {
>>   return a * b + c;
>> }
>> 
>> typedef unsigned int vec __attribute__((vector_size(32))); vec
>> f3 (vec a, vec b, vec c)
>> {
>>   vec d = a * b;
>>   return d + ((c + d) >> 1);
>> }
>> vec
>> g3 (vec a, vec b, vec c)
>> {
>>   return a * b + c;
>> }
>> 
>> compiled with -O2 -msve-vector-bits=256 -march=armv8.2-a+sve, all the g
>> functions use multiply-add (as expected), but the f functions are:
>> 
>> f1:
>>         mul     w1, w0, w1
>>         add     w0, w1, w2
>>         add     w0, w1, w0, lsr 1
>>         ret
>> 
>> f2:
>>         mul     v0.4s, v0.4s, v1.4s
>>         add     v2.4s, v0.4s, v2.4s
>>         usra    v0.4s, v2.4s, 1
>>         ret
>> 
>> f3:
>>         ...
>>         mla     z0.s, p0/m, z1.s, z2.s
>>         lsr     z0.s, z0.s, #1
>>         mad     z1.s, p0/m, z2.s, z0.s
>>         ...
>> 
>> What we do for f3 doesn't seem like a good idea.
>
> Agreed,  I guess this means I have to fix that as well? ☹
>
> I'll go take a look then..

How about something like this, before the main loop in
convert_mult_to_fma:

  /* There is no numerical difference between fused and unfused integer FMAs,
     and the assumption below that FMA is as cheap as addition is unlikely
     to be true, especially if the multiplication occurs multiple times on
     the same chain.  E.g., for something like:

         (((a * b) + c) >> 1) + (a * b)

     we do not want to duplicate the a * b into two additions, not least
     because the result is not a natural FMA chain.  */
  if (ANY_INTEGRAL_TYPE_P (type)
      && !has_single_use (mul_result))
    return false;

?  Richi, would that be OK with you?

From a quick check, it passes the aarch64-sve{,2}.exp tests.

Thanks,
Richard
Richard Biener March 1, 2023, 11:30 a.m. UTC | #41
On Tue, 28 Feb 2023, Richard Sandiford wrote:

> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> Sent: Tuesday, February 28, 2023 11:09 AM
> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> >> by using new optabs [PR108583]
> >> 
> >> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> -----Original Message-----
> >> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> >> Sent: Monday, February 27, 2023 9:33 PM
> >> >> To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>
> >> >> Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>;
> >> >> rguenther@suse.de; jlaw@ventanamicro.com
> >> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
> >> >> div-bitmask by using new optabs [PR108583]
> >> >>
> >> >> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> >> >> -----Original Message-----
> >> >> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> >> >> Sent: Monday, February 27, 2023 12:12 PM
> >> >> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> >> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> >> >> >> <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> >> >> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
> >> >> >> div-bitmask by using new optabs [PR108583]
> >> >> >>
> >> >> >> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> >> > Hi,
> >> >> >> >
> >> >> >> >> > I avoided open coding it with add and shift because it
> >> >> >> >> > creates a
> >> >> >> >> > 4 instructions (and shifts which are typically slow)
> >> >> >> >> > dependency chain instead of a load and multiply.  This
> >> >> >> >> > change, unless the target is known to optimize it further is
> >> >> >> >> > unlikely to be beneficial.  And by the time we get to costing
> >> >> >> >> > the only alternative is to undo the existing pattern and
> >> >> >> >> so you lose the general shift optimization.
> >> >> >> >> >
> >> >> >> >> > So it seemed unwise to open code as shifts, given the codegen
> >> >> >> >> > out of the vectorizer would be degenerate for most targets or
> >> >> >> >> > one needs the more complicated route of costing during
> >> >> >> >> > pattern
> >> >> matching already.
> >> >> >> >>
> >> >> >> >> Hmm, OK.  That seems like a cost-model thing though, rather
> >> >> >> >> than something that should be exposed through optabs.  And I
> >> >> >> >> imagine the open-coded version would still be better than
> >> >> >> >> nothing on targets without
> >> >> >> highpart multiply.
> >> >> >> >>
> >> >> >> >> So how about replacing the hook with one that simply asks
> >> >> >> >> whether division through highpart multiplication is preferred
> >> >> >> >> over the add/shift
> >> >> >> sequence?
> >> >> >> >> (Unfortunately it's not going to be possible to work that out
> >> >> >> >> from existing
> >> >> >> >> information.)
> >> >> >> >
> >> >> >> > So this doesn't work for SVE.  For SVE the multiplication
> >> >> >> > widening pass introduces FMAs at gimple level.  So in the cases
> >> >> >> > where the operation is fed from a widening multiplication we end
> >> >> >> > up generating
> >> >> FMA.
> >> >> >> If that was it I could have matched FMA.
> >> >> >> >
> >> >> >> > But it also pushes the multiplication in the second operand
> >> >> >> > because it no longer has a mul to share the results with.
> >> >> >> >
> >> >> >> > In any case, the gimple code is transformed into
> >> >> >> >
> >> >> >> > vect__3.8_122 = .MASK_LOAD (_29, 8B, loop_mask_121);
> >> >> >> > vect_patt_57.9_123 = (vector([8,8]) unsigned short)
> >> >> >> > vect__3.8_122;
> >> >> >> > vect_patt_64.11_127 = .FMA (vect_patt_57.9_123, vect_cst__124, {
> >> >> >> > 257, ... });
> >> >> >> > vect_patt_65.12_128 = vect_patt_64.11_127 >> 8;
> >> >> >> > vect_patt_66.13_129 = .FMA (vect_patt_57.9_123, vect_cst__124,
> >> >> >> > vect_patt_65.12_128);
> >> >> >> > vect_patt_62.14_130 = vect_patt_66.13_129 >> 8;
> >> >> >> > vect_patt_68.15_131 = (vector([8,8]) unsigned charD.21)
> >> >> >> > vect_patt_62.14_130;
> >> >> >> >
> >> >> >> > This transformation is much worse than the original code, it
> >> >> >> > extended the dependency chain with another expensive
> >> >> >> > instruction. I can try to correct this in RTL by matching FMA
> >> >> >> > and shift and splitting into MUL +
> >> >> >> ADDHNB and hope CSE takes care of the extra mul.
> >> >> >> >
> >> >> >> > But this seems like a hack, and it's basically undoing the
> >> >> >> > earlier transformation.  It seems to me that the open coding is a bad
> >> idea.
> >> >> >>
> >> >> >> Could you post the patch that gives this result?  I'll have a poke around.
> >> >> >
> >> >> > Sure, I'll post the new series, it needs all of them.
> >> >>
> >> >> Thanks.  Which testcase did you use to get the above?
> >> >>
> >> >
> >> > #include <stdint.h>
> >> >
> >> > #define N 16
> >> > #define TYPE uint8_t
> >> >
> >> > void fun3(TYPE* restrict pixel, TYPE level, int n) {
> >> >   for (int i = 0; i < (n & -16); i+=1)
> >> >     pixel[i] = (pixel[i] * level) / 0xff; }
> >> 
> >> Thanks.  In that testcase, isn't the FMA handling an anti-optimisation in its
> >> own right though?  It's duplicating a multiplication into two points on a
> >> dependency chain.
> >
> > Most definitely, that's what I meant above. The "optimization" doesn't take into
> > account the effect on the rest of the chain.
> >
> >> 
> >> E.g. for:
> >> 
> >> unsigned int
> >> f1 (unsigned int a, unsigned int b, unsigned int c) {
> >>   unsigned int d = a * b;
> >>   return d + ((c + d) >> 1);
> >> }
> >> unsigned int
> >> g1 (unsigned int a, unsigned int b, unsigned int c) {
> >>   return a * b + c;
> >> }
> >> 
> >> __Uint32x4_t
> >> f2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c) {
> >>   __Uint32x4_t d = a * b;
> >>   return d + ((c + d) >> 1);
> >> }
> >> __Uint32x4_t
> >> g2 (__Uint32x4_t a, __Uint32x4_t b, __Uint32x4_t c) {
> >>   return a * b + c;
> >> }
> >> 
> >> typedef unsigned int vec __attribute__((vector_size(32))); vec
> >> f3 (vec a, vec b, vec c)
> >> {
> >>   vec d = a * b;
> >>   return d + ((c + d) >> 1);
> >> }
> >> vec
> >> g3 (vec a, vec b, vec c)
> >> {
> >>   return a * b + c;
> >> }
> >> 
> >> compiled with -O2 -msve-vector-bits=256 -march=armv8.2-a+sve, all the g
> >> functions use multiply-add (as expected), but the f functions are:
> >> 
> >> f1:
> >>         mul     w1, w0, w1
> >>         add     w0, w1, w2
> >>         add     w0, w1, w0, lsr 1
> >>         ret
> >> 
> >> f2:
> >>         mul     v0.4s, v0.4s, v1.4s
> >>         add     v2.4s, v0.4s, v2.4s
> >>         usra    v0.4s, v2.4s, 1
> >>         ret
> >> 
> >> f3:
> >>         ...
> >>         mla     z0.s, p0/m, z1.s, z2.s
> >>         lsr     z0.s, z0.s, #1
> >>         mad     z1.s, p0/m, z2.s, z0.s
> >>         ...
> >> 
> >> What we do for f3 doesn't seem like a good idea.
> >
> > Agreed,  I guess this means I have to fix that as well? ☹
> >
> > I'll go take a look then..
> 
> How about something like this, before the main loop in
> convert_mult_to_fma:
> 
>   /* There is no numerical difference between fused and unfused integer FMAs,
>      and the assumption below that FMA is as cheap as addition is unlikely
>      to be true, especially if the multiplication occurs multiple times on
>      the same chain.  E.g., for something like:
> 
>          (((a * b) + c) >> 1) + (a * b)
> 
>      we do not want to duplicate the a * b into two additions, not least
>      because the result is not a natural FMA chain.  */
>   if (ANY_INTEGRAL_TYPE_P (type)
>       && !has_single_use (mul_result))
>     return false;
> 
> ?  Richi, would that be OK with you?

Yes, I think that would be OK.  I would assume that integer FMA is
as cheap as multiplication (also for FP), but then the question is
how CPUs implement integer FMA, if they split into two uops or not.

Richard.
Andrew Carlotti March 1, 2023, 4:57 p.m. UTC | #42
On Thu, Feb 23, 2023 at 11:39:51AM -0500, Andrew MacLeod via Gcc-patches wrote:
> 
> 
> Inheriting from operator_mult is also going to be hazardous because it also
> has an op1_range and op2_range...� you should at least define those and
> return VARYING to avoid other issues.� Same thing applies to widen_plus I
> think, and it has relation processing and other things as well.� Your widen
> operands are not what those classes expect, so I think you probably just
> want a fresh range operator.
> 
> It also looks like the mult operation is sign/zero extending both upper
> bounds, and neither lower bound..�� I think that should be the LH upper and
> lower bound?
> 
> I've attached a second patch� (newversion.patch) which incorporates my fix,
> the fix to the sign of only op1's bounds,� as well as a simplification of
> the classes to not inherit from operator_mult/plus..�� I think this still
> does what you want?� and it wont get you into unexpected trouble later :-)
> 
> let me know if this is still doing what you are expecting...
> 
> Andrew
> 

Hi,

This patch still uses the wrong signedness for some of the extensions in
WIDEN_MULT_EXPR. It currently bases it's promotion decisions on whether there
is any signed argument, and whether the result is signed - i.e.:

		Patch extends as:
UUU		UU
UUS -> USU
USU		SU
USS		SU	wrong
SUU		US	wrong
SUS -> SSU
SSU		SS	wrong
SSS		SS

The documentation in tree.def is unclear about whether the output signedness is
linked to the input signedness, but at least the SSU case seems valid, and is
mishandled here.

I think it would be clearer and simpler to have four (or three) different
versions for each combnation of signedness of the input operands. This could be
implemented without extra code duplication by creating four different instances
of an operator_widen_mult class (perhaps extending a range_operator_mixed_sign
class), with the signedness indicated by two additional class members.

The documentation for WIDEN_PLUS_EXPR (and several other expressions added in
the same commit) is completely missing. If the signs are required to be
matching, then this should be clarified; otherwise it would need the same
special handling as WIDEN_MULT_EXPR.

Andrew

> diff --git a/gcc/gimple-range-op.cc b/gcc/gimple-range-op.cc
> index d9dfdc56939..824e0338f34 100644
> --- a/gcc/gimple-range-op.cc
> +++ b/gcc/gimple-range-op.cc
> @@ -179,6 +179,8 @@ gimple_range_op_handler::gimple_range_op_handler (gimple *s)
>    // statements.
>    if (is_a <gcall *> (m_stmt))
>      maybe_builtin_call ();
> +  else
> +    maybe_non_standard ();
>  }
>  
>  // Calculate what we can determine of the range of this unary
> @@ -764,6 +766,36 @@ public:
>    }
>  } op_cfn_parity;
>  
> +// Set up a gimple_range_op_handler for any nonstandard function which can be
> +// supported via range-ops.
> +
> +void
> +gimple_range_op_handler::maybe_non_standard ()
> +{
> +  if (gimple_code (m_stmt) == GIMPLE_ASSIGN)
> +    switch (gimple_assign_rhs_code (m_stmt))
> +      {
> +	case WIDEN_MULT_EXPR:
> +	{
> +	  m_valid = true;
> +	  m_op1 = gimple_assign_rhs1 (m_stmt);
> +	  m_op2 = gimple_assign_rhs2 (m_stmt);
> +	  bool signed1 = TYPE_SIGN (TREE_TYPE (m_op1)) == SIGNED;
> +	  bool signed2 = TYPE_SIGN (TREE_TYPE (m_op2)) == SIGNED;
> +	  if (signed2 && !signed1)
> +	    std::swap (m_op1, m_op2);
> +
> +	  if (signed1 || signed2)
> +	    m_int = ptr_op_widen_mult_signed;
> +	  else
> +	    m_int = ptr_op_widen_mult_unsigned;
> +	  break;
> +	}
> +	default:
> +	  break;
> +      }
> +}
> +
>  // Set up a gimple_range_op_handler for any built in function which can be
>  // supported via range-ops.
>  
> diff --git a/gcc/gimple-range-op.h b/gcc/gimple-range-op.h
> index 743b858126e..1bf63c5ce6f 100644
> --- a/gcc/gimple-range-op.h
> +++ b/gcc/gimple-range-op.h
> @@ -41,6 +41,7 @@ public:
>  		 relation_trio = TRIO_VARYING);
>  private:
>    void maybe_builtin_call ();
> +  void maybe_non_standard ();
>    gimple *m_stmt;
>    tree m_op1, m_op2;
>  };
> diff --git a/gcc/range-op.cc b/gcc/range-op.cc
> index 5c67bce6d3a..7cd19a92d00 100644
> --- a/gcc/range-op.cc
> +++ b/gcc/range-op.cc
> @@ -1556,6 +1556,34 @@ operator_plus::op2_range (irange &r, tree type,
>    return op1_range (r, type, lhs, op1, rel.swap_op1_op2 ());
>  }
>  
> +class operator_widen_plus : public range_operator
> +{
> +public:
> +  virtual void wi_fold (irange &r, tree type,
> +			const wide_int &lh_lb,
> +			const wide_int &lh_ub,
> +			const wide_int &rh_lb,
> +			const wide_int &rh_ub) const;
> +} op_widen_plus;
> +
> +void
> +operator_widen_plus::wi_fold (irange &r, tree type,
> +			const wide_int &lh_lb, const wide_int &lh_ub,
> +			const wide_int &rh_lb, const wide_int &rh_ub) const
> +{
> +   wi::overflow_type ov_lb, ov_ub;
> +   signop s = TYPE_SIGN (type);
> +
> +   wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, s);
> +   wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
> +   wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, s);
> +   wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
> +
> +   wide_int new_lb = wi::add (lh_wlb, rh_wlb, s, &ov_lb);
> +   wide_int new_ub = wi::add (lh_wub, rh_wub, s, &ov_ub);
> +
> +   r = int_range<2> (type, new_lb, new_ub);
> +}
>  
>  class operator_minus : public range_operator
>  {
> @@ -2031,6 +2059,70 @@ operator_mult::wi_fold (irange &r, tree type,
>      }
>  }
>  
> +class operator_widen_mult_signed : public range_operator
> +{
> +public:
> +  virtual void wi_fold (irange &r, tree type,
> +			const wide_int &lh_lb,
> +			const wide_int &lh_ub,
> +			const wide_int &rh_lb,
> +			const wide_int &rh_ub)
> +    const;
> +} op_widen_mult_signed;
> +range_operator *ptr_op_widen_mult_signed = &op_widen_mult_signed;
> +
> +void
> +operator_widen_mult_signed::wi_fold (irange &r, tree type,
> +				     const wide_int &lh_lb,
> +				     const wide_int &lh_ub,
> +				     const wide_int &rh_lb,
> +				     const wide_int &rh_ub) const
> +{
> +  signop s = TYPE_SIGN (type);
> +
> +  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, SIGNED);
> +  wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, SIGNED);
> +  wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
> +  wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
> +
> +  /* We don't expect a widening multiplication to be able to overflow but range
> +     calculations for multiplications are complicated.  After widening the
> +     operands lets call the base class.  */
> +  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
> +}
> +
> +
> +class operator_widen_mult_unsigned : public range_operator
> +{
> +public:
> +  virtual void wi_fold (irange &r, tree type,
> +			const wide_int &lh_lb,
> +			const wide_int &lh_ub,
> +			const wide_int &rh_lb,
> +			const wide_int &rh_ub)
> +    const;
> +} op_widen_mult_unsigned;
> +range_operator *ptr_op_widen_mult_unsigned = &op_widen_mult_unsigned;
> +
> +void
> +operator_widen_mult_unsigned::wi_fold (irange &r, tree type,
> +				       const wide_int &lh_lb,
> +				       const wide_int &lh_ub,
> +				       const wide_int &rh_lb,
> +				       const wide_int &rh_ub) const
> +{
> +  signop s = TYPE_SIGN (type);
> +
> +  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, UNSIGNED);
> +  wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, UNSIGNED);
> +  wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
> +  wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
> +
> +  /* We don't expect a widening multiplication to be able to overflow but range
> +     calculations for multiplications are complicated.  After widening the
> +     operands lets call the base class.  */
> +  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub);
> +}
>  
>  class operator_div : public cross_product_operator
>  {
> @@ -4473,6 +4565,7 @@ integral_table::integral_table ()
>    set (GT_EXPR, op_gt);
>    set (GE_EXPR, op_ge);
>    set (PLUS_EXPR, op_plus);
> +  set (WIDEN_PLUS_EXPR, op_widen_plus);
>    set (MINUS_EXPR, op_minus);
>    set (MIN_EXPR, op_min);
>    set (MAX_EXPR, op_max);
> diff --git a/gcc/range-op.h b/gcc/range-op.h
> index f00b747f08a..5fe463234ae 100644
> --- a/gcc/range-op.h
> +++ b/gcc/range-op.h
> @@ -311,4 +311,6 @@ private:
>  // This holds the range op table for floating point operations.
>  extern floating_op_table *floating_tree_table;
>  
> +extern range_operator *ptr_op_widen_mult_signed;
> +extern range_operator *ptr_op_widen_mult_unsigned;
>  #endif // GCC_RANGE_OP_H
Tamar Christina March 1, 2023, 6:16 p.m. UTC | #43
> -----Original Message-----
> From: Andrew Carlotti <Andrew.Carlotti@arm.com>
> Sent: Wednesday, March 1, 2023 4:58 PM
> To: Andrew MacLeod <amacleod@redhat.com>
> Cc: Tamar Christina <Tamar.Christina@arm.com>; Richard Biener
> <rguenther@suse.de>; Richard Sandiford <Richard.Sandiford@arm.com>;
> Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <nd@arm.com>; jlaw@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> On Thu, Feb 23, 2023 at 11:39:51AM -0500, Andrew MacLeod via Gcc-
> patches wrote:
> >
> >
> > Inheriting from operator_mult is also going to be hazardous because it
> > also has an op1_range and op2_range...� you should at least define
> > those and return VARYING to avoid other issues.� Same thing applies
> > to widen_plus I think, and it has relation processing and other things
> > as well.� Your widen operands are not what those classes expect, so
> > I think you probably just want a fresh range operator.
> >
> > It also looks like the mult operation is sign/zero extending both
> > upper bounds, and neither lower bound..�� I think that should be
> > the LH upper and lower bound?
> >
> > I've attached a second patch� (newversion.patch) which incorporates
> > my fix, the fix to the sign of only op1's bounds,� as well as a
> > simplification of the classes to not inherit from
> > operator_mult/plus..�� I think this still does what you want?�
> > and it wont get you into unexpected trouble later :-)
> >
> > let me know if this is still doing what you are expecting...
> >
> > Andrew
> >
> 
> Hi,
> 
> This patch still uses the wrong signedness for some of the extensions in
> WIDEN_MULT_EXPR. It currently bases it's promotion decisions on whether
> there is any signed argument, and whether the result is signed - i.e.:
> 
> 		Patch extends as:
> UUU		UU
> UUS -> USU
> USU		SU
> USS		SU	wrong
> SUU		US	wrong
> SUS -> SSU
> SSU		SS	wrong
> SSS		SS
> 
> The documentation in tree.def is unclear about whether the output
> signedness is linked to the input signedness, but at least the SSU case seems
> valid, and is mishandled here.

Hi,

Thanks for the concern, but I don't think those "wrong" cases are valid.
There's only one explicit carve-out for this mismatch that I'm aware of which is
for constants that fit in the source type.  convert_mult_to_widen doesn't accept
them otherwise.

For every other mismatched sign it will fold an explicit convert into the sequence
to ensure all three types match.

i.e. 

long unsigned d1(int x, int y)
{
    return (long unsigned)x * y;
}

Requires a cast.

long unsigned d1(int x, int y)
{
    return (long unsigned)x * 4;
}

Does not, and

long unsigned d1(int x, int y)
{
    return (long unsigned)x * -4;
}

Does not fit and so is not accepted.  The reason it can happen is that the unsigned
cast on a positive constant is discarded.

Further more, the operation that introduces this widening only looks at the sign of the left
most operand and that of the result.

So this is correctly handling the normal cases and the abnormal ones the compiler introduces
as specific optimizations.

Tamar.


> 
> I think it would be clearer and simpler to have four (or three) different versions
> for each combnation of signedness of the input operands. This could be
> implemented without extra code duplication by creating four different
> instances of an operator_widen_mult class (perhaps extending a
> range_operator_mixed_sign class), with the signedness indicated by two
> additional class members.
> 
> The documentation for WIDEN_PLUS_EXPR (and several other expressions
> added in the same commit) is completely missing. If the signs are required to
> be matching, then this should be clarified; otherwise it would need the same
> special handling as WIDEN_MULT_EXPR.
> 
> Andrew
> 
> > diff --git a/gcc/gimple-range-op.cc b/gcc/gimple-range-op.cc index
> > d9dfdc56939..824e0338f34 100644
> > --- a/gcc/gimple-range-op.cc
> > +++ b/gcc/gimple-range-op.cc
> > @@ -179,6 +179,8 @@
> gimple_range_op_handler::gimple_range_op_handler (gimple *s)
> >    // statements.
> >    if (is_a <gcall *> (m_stmt))
> >      maybe_builtin_call ();
> > +  else
> > +    maybe_non_standard ();
> >  }
> >
> >  // Calculate what we can determine of the range of this unary @@
> > -764,6 +766,36 @@ public:
> >    }
> >  } op_cfn_parity;
> >
> > +// Set up a gimple_range_op_handler for any nonstandard function
> > +which can be // supported via range-ops.
> > +
> > +void
> > +gimple_range_op_handler::maybe_non_standard () {
> > +  if (gimple_code (m_stmt) == GIMPLE_ASSIGN)
> > +    switch (gimple_assign_rhs_code (m_stmt))
> > +      {
> > +	case WIDEN_MULT_EXPR:
> > +	{
> > +	  m_valid = true;
> > +	  m_op1 = gimple_assign_rhs1 (m_stmt);
> > +	  m_op2 = gimple_assign_rhs2 (m_stmt);
> > +	  bool signed1 = TYPE_SIGN (TREE_TYPE (m_op1)) == SIGNED;
> > +	  bool signed2 = TYPE_SIGN (TREE_TYPE (m_op2)) == SIGNED;
> > +	  if (signed2 && !signed1)
> > +	    std::swap (m_op1, m_op2);
> > +
> > +	  if (signed1 || signed2)
> > +	    m_int = ptr_op_widen_mult_signed;
> > +	  else
> > +	    m_int = ptr_op_widen_mult_unsigned;
> > +	  break;
> > +	}
> > +	default:
> > +	  break;
> > +      }
> > +}
> > +
> >  // Set up a gimple_range_op_handler for any built in function which
> > can be  // supported via range-ops.
> >
> > diff --git a/gcc/gimple-range-op.h b/gcc/gimple-range-op.h index
> > 743b858126e..1bf63c5ce6f 100644
> > --- a/gcc/gimple-range-op.h
> > +++ b/gcc/gimple-range-op.h
> > @@ -41,6 +41,7 @@ public:
> >  		 relation_trio = TRIO_VARYING);
> >  private:
> >    void maybe_builtin_call ();
> > +  void maybe_non_standard ();
> >    gimple *m_stmt;
> >    tree m_op1, m_op2;
> >  };
> > diff --git a/gcc/range-op.cc b/gcc/range-op.cc index
> > 5c67bce6d3a..7cd19a92d00 100644
> > --- a/gcc/range-op.cc
> > +++ b/gcc/range-op.cc
> > @@ -1556,6 +1556,34 @@ operator_plus::op2_range (irange &r, tree type,
> >    return op1_range (r, type, lhs, op1, rel.swap_op1_op2 ());  }
> >
> > +class operator_widen_plus : public range_operator {
> > +public:
> > +  virtual void wi_fold (irange &r, tree type,
> > +			const wide_int &lh_lb,
> > +			const wide_int &lh_ub,
> > +			const wide_int &rh_lb,
> > +			const wide_int &rh_ub) const;
> > +} op_widen_plus;
> > +
> > +void
> > +operator_widen_plus::wi_fold (irange &r, tree type,
> > +			const wide_int &lh_lb, const wide_int &lh_ub,
> > +			const wide_int &rh_lb, const wide_int &rh_ub) const {
> > +   wi::overflow_type ov_lb, ov_ub;
> > +   signop s = TYPE_SIGN (type);
> > +
> > +   wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb) * 2, s);
> > +   wide_int rh_wlb = wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);
> > +   wide_int lh_wub = wide_int::from (lh_ub, wi::get_precision (lh_ub) * 2, s);
> > +   wide_int rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub)
> > + * 2, s);
> > +
> > +   wide_int new_lb = wi::add (lh_wlb, rh_wlb, s, &ov_lb);
> > +   wide_int new_ub = wi::add (lh_wub, rh_wub, s, &ov_ub);
> > +
> > +   r = int_range<2> (type, new_lb, new_ub); }
> >
> >  class operator_minus : public range_operator  { @@ -2031,6 +2059,70
> > @@ operator_mult::wi_fold (irange &r, tree type,
> >      }
> >  }
> >
> > +class operator_widen_mult_signed : public range_operator {
> > +public:
> > +  virtual void wi_fold (irange &r, tree type,
> > +			const wide_int &lh_lb,
> > +			const wide_int &lh_ub,
> > +			const wide_int &rh_lb,
> > +			const wide_int &rh_ub)
> > +    const;
> > +} op_widen_mult_signed;
> > +range_operator *ptr_op_widen_mult_signed = &op_widen_mult_signed;
> > +
> > +void
> > +operator_widen_mult_signed::wi_fold (irange &r, tree type,
> > +				     const wide_int &lh_lb,
> > +				     const wide_int &lh_ub,
> > +				     const wide_int &rh_lb,
> > +				     const wide_int &rh_ub) const {
> > +  signop s = TYPE_SIGN (type);
> > +
> > +  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb)
> > + * 2, SIGNED);  wide_int lh_wub = wide_int::from (lh_ub,
> > + wi::get_precision (lh_ub) * 2, SIGNED);  wide_int rh_wlb =
> > + wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);  wide_int
> > + rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
> > +
> > +  /* We don't expect a widening multiplication to be able to overflow but
> range
> > +     calculations for multiplications are complicated.  After widening the
> > +     operands lets call the base class.  */
> > +  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub); }
> > +
> > +
> > +class operator_widen_mult_unsigned : public range_operator {
> > +public:
> > +  virtual void wi_fold (irange &r, tree type,
> > +			const wide_int &lh_lb,
> > +			const wide_int &lh_ub,
> > +			const wide_int &rh_lb,
> > +			const wide_int &rh_ub)
> > +    const;
> > +} op_widen_mult_unsigned;
> > +range_operator *ptr_op_widen_mult_unsigned =
> &op_widen_mult_unsigned;
> > +
> > +void
> > +operator_widen_mult_unsigned::wi_fold (irange &r, tree type,
> > +				       const wide_int &lh_lb,
> > +				       const wide_int &lh_ub,
> > +				       const wide_int &rh_lb,
> > +				       const wide_int &rh_ub) const {
> > +  signop s = TYPE_SIGN (type);
> > +
> > +  wide_int lh_wlb = wide_int::from (lh_lb, wi::get_precision (lh_lb)
> > + * 2, UNSIGNED);  wide_int lh_wub = wide_int::from (lh_ub,
> > + wi::get_precision (lh_ub) * 2, UNSIGNED);  wide_int rh_wlb =
> > + wide_int::from (rh_lb, wi::get_precision (rh_lb) * 2, s);  wide_int
> > + rh_wub = wide_int::from (rh_ub, wi::get_precision (rh_ub) * 2, s);
> > +
> > +  /* We don't expect a widening multiplication to be able to overflow but
> range
> > +     calculations for multiplications are complicated.  After widening the
> > +     operands lets call the base class.  */
> > +  return op_mult.wi_fold (r, type, lh_wlb, lh_wub, rh_wlb, rh_wub); }
> >
> >  class operator_div : public cross_product_operator  { @@ -4473,6
> > +4565,7 @@ integral_table::integral_table ()
> >    set (GT_EXPR, op_gt);
> >    set (GE_EXPR, op_ge);
> >    set (PLUS_EXPR, op_plus);
> > +  set (WIDEN_PLUS_EXPR, op_widen_plus);
> >    set (MINUS_EXPR, op_minus);
> >    set (MIN_EXPR, op_min);
> >    set (MAX_EXPR, op_max);
> > diff --git a/gcc/range-op.h b/gcc/range-op.h index
> > f00b747f08a..5fe463234ae 100644
> > --- a/gcc/range-op.h
> > +++ b/gcc/range-op.h
> > @@ -311,4 +311,6 @@ private:
> >  // This holds the range op table for floating point operations.
> >  extern floating_op_table *floating_tree_table;
> >
> > +extern range_operator *ptr_op_widen_mult_signed; extern
> > +range_operator *ptr_op_widen_mult_unsigned;
> >  #endif // GCC_RANGE_OP_H
diff mbox series

Patch

--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5668,6 +5668,18 @@  represented in RTL using a @code{smul_highpart} RTX expression.
 Similar, but the multiplication is unsigned.  This may be represented
 in RTL using an @code{umul_highpart} RTX expression.
 
+@cindex @code{sadd@var{m}3_highpart} instruction pattern
+@item @samp{smul@var{m}3_highpart}
+Perform a signed addition of operands 1 and 2, which have mode
+@var{m}, and store the most significant half of the product in operand 0.
+The least significant half of the product is discarded.  This may be
+represented in RTL using a @code{sadd_highpart} RTX expression.
+
+@cindex @code{uadd@var{m}3_highpart} instruction pattern
+@item @samp{uadd@var{m}3_highpart}
+Similar, but the addition is unsigned.  This may be represented
+in RTL using an @code{uadd_highpart} RTX expression.
+
 @cindex @code{madd@var{m}@var{n}4} instruction pattern
 @item @samp{madd@var{m}@var{n}4}
 Multiply operands 1 and 2, sign-extend them to mode @var{n}, add
diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
index d1380e1eb3ba6b2853686f41f2bf937bfcbed1fe..63a7ef6e566eeea4f14c00343d171940ec4222f3 100644
--- a/gcc/doc/rtl.texi
+++ b/gcc/doc/rtl.texi
@@ -2535,6 +2535,17 @@  out in machine mode @var{m}.  @code{smul_highpart} returns the high part
 of a signed multiplication, @code{umul_highpart} returns the high part
 of an unsigned multiplication.
 
+@findex sadd_highpart
+@findex uadd_highpart
+@cindex high-part addition
+@cindex addition high part
+@item (sadd_highpart:@var{m} @var{x} @var{y})
+@itemx (uadd_highpart:@var{m} @var{x} @var{y})
+Represents the high-part addition of @var{x} and @var{y} carried
+out in machine mode @var{m}.  @code{sadd_highpart} returns the high part
+of a signed addition, @code{uadd_highpart} returns the high part
+of an unsigned addition.
+
 @findex fma
 @cindex fused multiply-add
 @item (fma:@var{m} @var{x} @var{y} @var{z})
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index c6c891972d1e58cd163b259ba96a599d62326865..3ab2031a336b8758d5791484017e6b0d62ab077e 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6137,22 +6137,6 @@  instruction pattern.  There is no need for the hook to handle these two
 implementation approaches itself.
 @end deftypefn
 
-@deftypefn {Target Hook} bool TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST (enum @var{tree_code}, tree @var{vectype}, wide_int @var{constant}, rtx *@var{output}, rtx @var{in0}, rtx @var{in1})
-This hook is used to test whether the target has a special method of
-division of vectors of type @var{vectype} using the value @var{constant},
-and producing a vector of type @var{vectype}.  The division
-will then not be decomposed by the vectorizer and kept as a div.
-
-When the hook is being used to test whether the target supports a special
-divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook
-is being used to emit a division, @var{in0} and @var{in1} are the source
-vectors of type @var{vecttype} and @var{output} is the destination vector of
-type @var{vectype}.
-
-Return true if the operation is possible, emitting instructions for it
-if rtxes are provided and updating @var{output}.
-@end deftypefn
-
 @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION (unsigned @var{code}, tree @var{vec_type_out}, tree @var{vec_type_in})
 This hook should return the decl of a function that implements the
 vectorized variant of the function with the @code{combined_fn} code
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 613b2534149415f442163d599503efaf423b673b..8790f4e44b98b51ad5d1efec0a3abccd1c293c7b 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4173,8 +4173,6 @@  address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_VEC_PERM_CONST
 
-@hook TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
-
 @hook TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION
 
 @hook TARGET_VECTORIZE_BUILTIN_MD_VECTORIZED_FUNCTION
diff --git a/gcc/explow.cc b/gcc/explow.cc
index 83439b32abe1b9aa4b7983eb629804f97486acbd..be9195b33323ee5597fc212f0befa016eea4573c 100644
--- a/gcc/explow.cc
+++ b/gcc/explow.cc
@@ -1037,7 +1037,7 @@  round_push (rtx size)
      TRUNC_DIV_EXPR.  */
   size = expand_binop (Pmode, add_optab, size, alignm1_rtx,
 		       NULL_RTX, 1, OPTAB_LIB_WIDEN);
-  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, size, align_rtx,
+  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size, align_rtx,
 			NULL_RTX, 1);
   size = expand_mult (Pmode, size, align_rtx, NULL_RTX, 1);
 
@@ -1203,7 +1203,7 @@  align_dynamic_address (rtx target, unsigned required_align)
 			 gen_int_mode (required_align / BITS_PER_UNIT - 1,
 				       Pmode),
 			 NULL_RTX, 1, OPTAB_LIB_WIDEN);
-  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, NULL, NULL, target,
+  target = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, target,
 			  gen_int_mode (required_align / BITS_PER_UNIT,
 					Pmode),
 			  NULL_RTX, 1);
diff --git a/gcc/expmed.h b/gcc/expmed.h
index 0419e2dac85850889ce0bee59515e31a80c582de..4dfe635c22ee49f2dba4c53640941628068f3901 100644
--- a/gcc/expmed.h
+++ b/gcc/expmed.h
@@ -710,9 +710,8 @@  extern rtx expand_shift (enum tree_code, machine_mode, rtx, poly_int64, rtx,
 extern rtx maybe_expand_shift (enum tree_code, machine_mode, rtx, int, rtx,
 			       int);
 #ifdef GCC_OPTABS_H
-extern rtx expand_divmod (int, enum tree_code, machine_mode, tree, tree,
-			  rtx, rtx, rtx, int,
-			  enum optab_methods = OPTAB_LIB_WIDEN);
+extern rtx expand_divmod (int, enum tree_code, machine_mode, rtx, rtx,
+			  rtx, int, enum optab_methods = OPTAB_LIB_WIDEN);
 #endif
 #endif
 
diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index 917360199ca56157cf3c3693b65e93cd9d8ed244..1553ea8e31eb6433025ab18a3a59c169d3b7692f 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -4222,8 +4222,8 @@  expand_sdiv_pow2 (scalar_int_mode mode, rtx op0, HOST_WIDE_INT d)
 
 rtx
 expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
-	       tree treeop0, tree treeop1, rtx op0, rtx op1, rtx target,
-	       int unsignedp, enum optab_methods methods)
+	       rtx op0, rtx op1, rtx target, int unsignedp,
+	       enum optab_methods methods)
 {
   machine_mode compute_mode;
   rtx tquotient;
@@ -4375,17 +4375,6 @@  expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
 
   last_div_const = ! rem_flag && op1_is_constant ? INTVAL (op1) : 0;
 
-  /* Check if the target has specific expansions for the division.  */
-  tree cst;
-  if (treeop0
-      && treeop1
-      && (cst = uniform_integer_cst_p (treeop1))
-      && targetm.vectorize.can_special_div_by_const (code, TREE_TYPE (treeop0),
-						     wi::to_wide (cst),
-						     &target, op0, op1))
-    return target;
-
-
   /* Now convert to the best mode to use.  */
   if (compute_mode != mode)
     {
@@ -4629,8 +4618,8 @@  expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
 			    || (optab_handler (sdivmod_optab, int_mode)
 				!= CODE_FOR_nothing)))
 		      quotient = expand_divmod (0, TRUNC_DIV_EXPR,
-						int_mode, treeop0, treeop1,
-						op0, gen_int_mode (abs_d,
+						int_mode, op0,
+						gen_int_mode (abs_d,
 							      int_mode),
 						NULL_RTX, 0);
 		    else
@@ -4819,8 +4808,8 @@  expand_divmod (int rem_flag, enum tree_code code, machine_mode mode,
 				      size - 1, NULL_RTX, 0);
 		t3 = force_operand (gen_rtx_MINUS (int_mode, t1, nsign),
 				    NULL_RTX);
-		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, treeop0,
-				    treeop1, t3, op1, NULL_RTX, 0);
+		t4 = expand_divmod (0, TRUNC_DIV_EXPR, int_mode, t3, op1,
+				    NULL_RTX, 0);
 		if (t4)
 		  {
 		    rtx t5;
diff --git a/gcc/expr.cc b/gcc/expr.cc
index 15be1c8db999103bb9e5fa33daa44ae06de5ace8..78d35297e755216339078d5b2280c6e277f26d72 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -8207,17 +8207,16 @@  force_operand (rtx value, rtx target)
 	    return expand_divmod (0,
 				  FLOAT_MODE_P (GET_MODE (value))
 				  ? RDIV_EXPR : TRUNC_DIV_EXPR,
-				  GET_MODE (value), NULL, NULL, op1, op2,
-				  target, 0);
+				  GET_MODE (value), op1, op2, target, 0);
 	case MOD:
-	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
-				op1, op2, target, 0);
+	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
+				target, 0);
 	case UDIV:
-	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), NULL, NULL,
-				op1, op2, target, 1);
+	  return expand_divmod (0, TRUNC_DIV_EXPR, GET_MODE (value), op1, op2,
+				target, 1);
 	case UMOD:
-	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), NULL, NULL,
-				op1, op2, target, 1);
+	  return expand_divmod (1, TRUNC_MOD_EXPR, GET_MODE (value), op1, op2,
+				target, 1);
 	case ASHIFTRT:
 	  return expand_simple_binop (GET_MODE (value), code, op1, op2,
 				      target, 0, OPTAB_LIB_WIDEN);
@@ -9170,13 +9169,11 @@  expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
       bool speed_p = optimize_insn_for_speed_p ();
       do_pending_stack_adjust ();
       start_sequence ();
-      rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
-				   op0, op1, target, 1);
+      rtx uns_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 1);
       rtx_insn *uns_insns = get_insns ();
       end_sequence ();
       start_sequence ();
-      rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
-				   op0, op1, target, 0);
+      rtx sgn_ret = expand_divmod (mod_p, code, mode, op0, op1, target, 0);
       rtx_insn *sgn_insns = get_insns ();
       end_sequence ();
       unsigned uns_cost = seq_cost (uns_insns, speed_p);
@@ -9198,8 +9195,7 @@  expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
       emit_insn (sgn_insns);
       return sgn_ret;
     }
-  return expand_divmod (mod_p, code, mode, treeop0, treeop1,
-			op0, op1, target, unsignedp);
+  return expand_divmod (mod_p, code, mode, op0, op1, target, unsignedp);
 }
 
 rtx
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 22b4a2d92967076c658965afcaeaf39b449a8caf..2796d3669a0806538052584f5a3b8a734baa800f 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -174,6 +174,8 @@  DEF_INTERNAL_SIGNED_OPTAB_FN (AVG_CEIL, ECF_CONST | ECF_NOTHROW, first,
 
 DEF_INTERNAL_SIGNED_OPTAB_FN (MULH, ECF_CONST | ECF_NOTHROW, first,
 			      smul_highpart, umul_highpart, binary)
+DEF_INTERNAL_SIGNED_OPTAB_FN (ADDH, ECF_CONST | ECF_NOTHROW, first,
+			      sadd_highpart, uadd_highpart, binary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (MULHS, ECF_CONST | ECF_NOTHROW, first,
 			      smulhs, umulhs, binary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | ECF_NOTHROW, first,
diff --git a/gcc/optabs.cc b/gcc/optabs.cc
index cf22bfec3f5513f56d22c866231edbf322ff6945..474ccbd7915b4f144cebe0369a6e77082c1e617b 100644
--- a/gcc/optabs.cc
+++ b/gcc/optabs.cc
@@ -1106,9 +1106,8 @@  expand_doubleword_mod (machine_mode mode, rtx op0, rtx op1, bool unsignedp)
 		return NULL_RTX;
 	    }
 	}
-      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, NULL, NULL,
-				     sum, gen_int_mode (INTVAL (op1),
-							word_mode),
+      rtx remainder = expand_divmod (1, TRUNC_MOD_EXPR, word_mode, sum,
+				     gen_int_mode (INTVAL (op1), word_mode),
 				     NULL_RTX, 1, OPTAB_DIRECT);
       if (remainder == NULL_RTX)
 	return NULL_RTX;
@@ -1211,8 +1210,8 @@  expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
 
   if (op11 != const1_rtx)
     {
-      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, NULL, NULL, quot1,
-				op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
+      rtx rem2 = expand_divmod (1, TRUNC_MOD_EXPR, mode, quot1, op11,
+				NULL_RTX, unsignedp, OPTAB_DIRECT);
       if (rem2 == NULL_RTX)
 	return NULL_RTX;
 
@@ -1226,8 +1225,8 @@  expand_doubleword_divmod (machine_mode mode, rtx op0, rtx op1, rtx *rem,
       if (rem2 == NULL_RTX)
 	return NULL_RTX;
 
-      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, NULL, NULL, quot1,
-				 op11, NULL_RTX, unsignedp, OPTAB_DIRECT);
+      rtx quot2 = expand_divmod (0, TRUNC_DIV_EXPR, mode, quot1, op11,
+				 NULL_RTX, unsignedp, OPTAB_DIRECT);
       if (quot2 == NULL_RTX)
 	return NULL_RTX;
 
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 695f5911b300c9ca5737de9be809fa01aabe5e01..77a152ec2d1949deca2c2d7a5ccbf6147947351a 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -265,6 +265,8 @@  OPTAB_D (spaceship_optab, "spaceship$a3")
 
 OPTAB_D (smul_highpart_optab, "smul$a3_highpart")
 OPTAB_D (umul_highpart_optab, "umul$a3_highpart")
+OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart")
+OPTAB_D (uadd_highpart_optab, "uadd$a3_highpart")
 
 OPTAB_D (cmpmem_optab, "cmpmem$a")
 OPTAB_D (cmpstr_optab, "cmpstr$a")
diff --git a/gcc/target.def b/gcc/target.def
index db8af0cbe81624513f114fc9bbd8be61d855f409..e0a5c7adbd962f5d08ed08d1d81afa2c2baa64a5 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1905,25 +1905,6 @@  implementation approaches itself.",
 	const vec_perm_indices &sel),
  NULL)
 
-DEFHOOK
-(can_special_div_by_const,
- "This hook is used to test whether the target has a special method of\n\
-division of vectors of type @var{vectype} using the value @var{constant},\n\
-and producing a vector of type @var{vectype}.  The division\n\
-will then not be decomposed by the vectorizer and kept as a div.\n\
-\n\
-When the hook is being used to test whether the target supports a special\n\
-divide, @var{in0}, @var{in1}, and @var{output} are all null.  When the hook\n\
-is being used to emit a division, @var{in0} and @var{in1} are the source\n\
-vectors of type @var{vecttype} and @var{output} is the destination vector of\n\
-type @var{vectype}.\n\
-\n\
-Return true if the operation is possible, emitting instructions for it\n\
-if rtxes are provided and updating @var{output}.",
- bool, (enum tree_code, tree vectype, wide_int constant, rtx *output,
-	rtx in0, rtx in1),
- default_can_special_div_by_const)
-
 /* Return true if the target supports misaligned store/load of a
    specific factor denoted in the third parameter.  The last parameter
    is true if the access is defined in a packed struct.  */
diff --git a/gcc/target.h b/gcc/target.h
index 03fd03a52075b4836159035ec14078c0aebdd7e9..93691882757232c514fca82b99f913158c2d47b1 100644
--- a/gcc/target.h
+++ b/gcc/target.h
@@ -51,7 +51,6 @@ 
 #include "insn-codes.h"
 #include "tm.h"
 #include "hard-reg-set.h"
-#include "tree-core.h"
 
 #if CHECKING_P
 
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index a1df260f5483dc84f18d8f12c5202484a32d5bb7..a6a4809ca91baa5d7fad2244549317a31390f0c2 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -209,8 +209,6 @@  extern void default_addr_space_diagnose_usage (addr_space_t, location_t);
 extern rtx default_addr_space_convert (rtx, tree, tree);
 extern unsigned int default_case_values_threshold (void);
 extern bool default_have_conditional_execution (void);
-extern bool default_can_special_div_by_const (enum tree_code, tree, wide_int,
-					      rtx *, rtx, rtx);
 
 extern bool default_libc_has_function (enum function_class, tree);
 extern bool default_libc_has_fast_function (int fcode);
diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
index fe0116521feaf32187e7bc113bf93b1805852c79..211525720a620d6f533e2da91e03877337a931e7 100644
--- a/gcc/targhooks.cc
+++ b/gcc/targhooks.cc
@@ -1840,14 +1840,6 @@  default_have_conditional_execution (void)
   return HAVE_conditional_execution;
 }
 
-/* Default that no division by constant operations are special.  */
-bool
-default_can_special_div_by_const (enum tree_code, tree, wide_int, rtx *, rtx,
-				  rtx)
-{
-  return false;
-}
-
 /* By default we assume that c99 functions are present at the runtime,
    but sincos is not.  */
 bool
diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..c81f8946922250234bf759e0a0a04ea8c1f73e3c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-4.c
@@ -0,0 +1,25 @@ 
+/* { dg-require-effective-target vect_int } */
+
+#include <stdint.h>
+#include "tree-vect.h"
+
+typedef unsigned __attribute__((__vector_size__ (16))) V;
+
+static __attribute__((__noinline__)) __attribute__((__noclone__)) V
+foo (V v, unsigned short i)
+{
+  v /= i;
+  return v;
+}
+
+int
+main (void)
+{
+  V v = foo ((V) { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }, 0xffff);
+  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
+    if (v[i] != 0x00010001)
+      __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-not "vect_recog_divmod_pattern: detected" "vect" { target aarch64*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..b4eb1a4dacba481e6306b49914d2a29b933de625
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-5.c
@@ -0,0 +1,58 @@ 
+/* { dg-require-effective-target vect_int } */
+
+#include <stdint.h>
+#include <stdio.h>
+#include "tree-vect.h"
+
+#define N 50
+#define TYPE uint8_t 
+
+#ifndef DEBUG
+#define DEBUG 0
+#endif
+
+#define BASE ((TYPE) -1 < 0 ? -126 : 4)
+
+
+__attribute__((noipa, noinline, optimize("O1")))
+void fun1(TYPE* restrict pixel, TYPE level, int n)
+{
+  for (int i = 0; i < n; i+=1)
+    pixel[i] = (pixel[i] + level) / 0xff;
+}
+
+__attribute__((noipa, noinline, optimize("O3")))
+void fun2(TYPE* restrict pixel, TYPE level, int n)
+{
+  for (int i = 0; i < n; i+=1)
+    pixel[i] = (pixel[i] + level) / 0xff;
+}
+
+int main ()
+{
+  TYPE a[N];
+  TYPE b[N];
+
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE + i * 13;
+      b[i] = BASE + i * 13;
+      if (DEBUG)
+        printf ("%d: 0x%x\n", i, a[i]);
+    }
+
+  fun1 (a, N / 2, N);
+  fun2 (b, N / 2, N);
+
+  for (int i = 0; i < N; ++i)
+    {
+      if (DEBUG)
+        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
+
+      if (a[i] != b[i])
+        __builtin_abort ();
+    }
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "divmod pattern recognized" "vect" { target aarch64*-*-* } } } */
diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
index 166a248f4b9512d4c6fc8d760b458b7a467f7790..519a824ec727d4d4f28c14077dc3e970bed75ef6 100644
--- a/gcc/tree-vect-generic.cc
+++ b/gcc/tree-vect-generic.cc
@@ -1237,17 +1237,6 @@  expand_vector_operation (gimple_stmt_iterator *gsi, tree type, tree compute_type
 	  tree rhs2 = gimple_assign_rhs2 (assign);
 	  tree ret;
 
-	  /* Check if the target was going to handle it through the special
-	     division callback hook.  */
-	  tree cst = uniform_integer_cst_p (rhs2);
-	  if (cst &&
-	      targetm.vectorize.can_special_div_by_const (code, type,
-							  wi::to_wide (cst),
-							  NULL,
-							  NULL_RTX, NULL_RTX))
-	    return NULL_TREE;
-
-
 	  if (!optimize
 	      || !VECTOR_INTEGER_TYPE_P (type)
 	      || TREE_CODE (rhs2) != VECTOR_CST
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f3369de2afea139d6 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -3914,12 +3914,82 @@  vect_recog_divmod_pattern (vec_info *vinfo,
       return pattern_stmt;
     }
   else if ((cst = uniform_integer_cst_p (oprnd1))
-	   && targetm.vectorize.can_special_div_by_const (rhs_code, vectype,
-							  wi::to_wide (cst),
-							  NULL, NULL_RTX,
-							  NULL_RTX))
+	   && TYPE_UNSIGNED (itype)
+	   && rhs_code == TRUNC_DIV_EXPR
+	   && vectype
+	   && direct_internal_fn_supported_p (IFN_ADDH, vectype,
+					      OPTIMIZE_FOR_SPEED))
     {
-      return NULL;
+      /* div optimizations using narrowings
+       we can do the division e.g. shorts by 255 faster by calculating it as
+       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
+       double the precision of x.
+
+       If we imagine a short as being composed of two blocks of bytes then
+       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
+       adding 1 to each sub component:
+
+	    short value of 16-bits
+       ┌──────────────┬────────────────┐
+       │              │                │
+       └──────────────┴────────────────┘
+	 8-bit part1 ▲  8-bit part2   ▲
+		     │                │
+		     │                │
+		    +1               +1
+
+       after the first addition, we have to shift right by 8, and narrow the
+       results back to a byte.  Remember that the addition must be done in
+       double the precision of the input.  However if we know that the addition
+       `x + 257` does not overflow then we can do the operation in the current
+       precision.  In which case we don't need the pack and unpacks.  */
+      auto wcst = wi::to_wide (cst);
+      int pow = wi::exact_log2 (wcst + 1);
+      if (pow == (int) (element_precision (vectype) / 2))
+	{
+	  wide_int min,max;
+	  /* If we're in a pattern we need to find the orginal definition.  */
+	  tree op0 = oprnd0;
+	  gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
+	  stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
+	  if (is_pattern_stmt_p (stmt_info))
+	    {
+	      auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
+	      if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
+		op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
+	    }
+
+	  /* Check that no overflow will occur.  If we don't have range
+	     information we can't perform the optimization.  */
+	  if (vect_get_range_info (op0, &min, &max))
+	    {
+	      wide_int one = wi::to_wide (build_one_cst (itype));
+	      wide_int adder = wi::add (one, wi::lshift (one, pow));
+	      wi::overflow_type ovf;
+	      /* We need adder and max in the same precision.  */
+	      wide_int zadder
+		= wide_int_storage::from (adder, wi::get_precision (max),
+					  UNSIGNED);
+	      wi::add (max, zadder, UNSIGNED, &ovf);
+	      if (ovf == wi::OVF_NONE)
+		{
+		  *type_out = vectype;
+		  tree tadder = wide_int_to_tree (itype, adder);
+		  gcall *patt1
+		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, tadder);
+		  tree lhs = vect_recog_temp_ssa_var (itype, NULL);
+		  gimple_call_set_lhs (patt1, lhs);
+		  append_pattern_def_seq (vinfo, stmt_vinfo, patt1, vectype);
+
+		  pattern_stmt
+		    = gimple_build_call_internal (IFN_ADDH, 2, oprnd0, lhs);
+		  lhs = vect_recog_temp_ssa_var (itype, NULL);
+		  gimple_call_set_lhs (pattern_stmt, lhs);
+
+		  return pattern_stmt;
+		}
+	    }
+	}
     }
 
   if (prec > HOST_BITS_PER_WIDE_INT
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index eb4ca1f184e374d177eb43d5eb93acf6e6a8fde9..3a0fb5ad898ad42c3867f0b9564fc4e066e50081 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -6263,15 +6263,6 @@  vectorizable_operation (vec_info *vinfo,
 	}
       target_support_p = (optab_handler (optab, vec_mode)
 			  != CODE_FOR_nothing);
-      tree cst;
-      if (!target_support_p
-	  && op1
-	  && (cst = uniform_integer_cst_p (op1)))
-	target_support_p
-	  = targetm.vectorize.can_special_div_by_const (code, vectype,
-							wi::to_wide (cst),
-							NULL, NULL_RTX,
-							NULL_RTX);
     }
 
   bool using_emulated_vectors_p = vect_emulated_vector_p (vectype);