diff mbox series

[to-be-committed,RISC-V] Slightly improve broadcasting small constants into vectors

Message ID 25c5da43-6da0-41bb-a3e8-9c50f73ad90f@ventanamicro.com
State New
Headers show
Series [to-be-committed,RISC-V] Slightly improve broadcasting small constants into vectors | expand

Commit Message

Jeff Law Oct. 11, 2024, 1:25 p.m. UTC
I probably spent way more time on this than it's worth...

I was looking at the code we generate for vector SAD and noticed that we 
were being a bit silly.  Specifically:

         li      a4,0            # 272   [c=4 l=4]  *movsi_internal/1

Followed shortly by:

         vmv.s.x v3,a4   # 261   [c=4 l=4]  *pred_broadcastrvvm1si/6

And no other uses of a4.  We could have used x0 trivially.

First we adjust the expander so that it doesn't force the constant into 
a register.  In the matching pattern we change the appropriate source 
constraints from "r" to "rJ" and the output template is changed to use 
%z for the operand.  The net is we drop the li completely and emit 
vmv.s.x,v3,x0.

But wait, there's more.  If we're broadcasting a constant in the range 
[-16..15] into a vector, we currently load the constant into a register 
and use vmv.v.r.  We can instead use vmv.v.i, which avoids loading the 
constant into a GPR.  For that case we again avoid forcing the constant 
into a register in the expander and adjust the output template to emit 
vmv.v.x or vmv.v.i based on whether or not the appropriate operand is a 
constant or general purpose register.  So again, we'll drop a load 
immediate into a scalar for this case.

Whether or not we should use vmv.v.i vs vmv.s.x for loading [-16..15] 
into the 0th element is probably uarch dependent.  The tradeoff is 
loading the GPR vs the broadcast in the vector unit.  I didn't bother 
with this case.

Tested in my tester (which tests rv64gcv as a default codegen option). 
Will wait for the pre-commit tester to render a verdict.

Jeff
* config/riscv/constraints.md (P): New constraint for constant
	integers -16..15.
	* config/riscv/vector.md (pred_broadcast<mode> expander): Do not force
	constants into registers quite so aggressively.
	(pred_broadcast<mode> insn & splitter): Adjust constraints to allow
	constants in a few cases and adjust output appropriately.

Comments

Andrew Waterman Oct. 11, 2024, 11:40 p.m. UTC | #1
On Fri, Oct 11, 2024 at 6:26 AM Jeff Law <jlaw@ventanamicro.com> wrote:
>
> I probably spent way more time on this than it's worth...
>
> I was looking at the code we generate for vector SAD and noticed that we
> were being a bit silly.  Specifically:
>
>          li      a4,0            # 272   [c=4 l=4]  *movsi_internal/1
>
> Followed shortly by:
>
>          vmv.s.x v3,a4   # 261   [c=4 l=4]  *pred_broadcastrvvm1si/6
>
> And no other uses of a4.  We could have used x0 trivially.
>
> First we adjust the expander so that it doesn't force the constant into
> a register.  In the matching pattern we change the appropriate source
> constraints from "r" to "rJ" and the output template is changed to use
> %z for the operand.  The net is we drop the li completely and emit
> vmv.s.x,v3,x0.
>
> But wait, there's more.  If we're broadcasting a constant in the range
> [-16..15] into a vector, we currently load the constant into a register
> and use vmv.v.r.  We can instead use vmv.v.i, which avoids loading the
> constant into a GPR.  For that case we again avoid forcing the constant
> into a register in the expander and adjust the output template to emit
> vmv.v.x or vmv.v.i based on whether or not the appropriate operand is a
> constant or general purpose register.  So again, we'll drop a load
> immediate into a scalar for this case.
>
> Whether or not we should use vmv.v.i vs vmv.s.x for loading [-16..15]
> into the 0th element is probably uarch dependent.  The tradeoff is
> loading the GPR vs the broadcast in the vector unit.  I didn't bother
> with this case.

Note that this tradeoff is only interesting when LMUL is small.  When
LMUL is large, vmv.v.i does a lot more work than vmv.s.x (writing
multiple vector registers versus just one).

>
> Tested in my tester (which tests rv64gcv as a default codegen option).
> Will wait for the pre-commit tester to render a verdict.
>
> Jeff
Jeff Law Oct. 12, 2024, 12:21 a.m. UTC | #2
On 10/11/24 5:40 PM, Andrew Waterman wrote:
>> Whether or not we should use vmv.v.i vs vmv.s.x for loading [-16..15]
>> into the 0th element is probably uarch dependent.  The tradeoff is
>> loading the GPR vs the broadcast in the vector unit.  I didn't bother
>> with this case.
> 
> Note that this tradeoff is only interesting when LMUL is small.  When
> LMUL is large, vmv.v.i does a lot more work than vmv.s.x (writing
> multiple vector registers versus just one).
Very true and I would expect LMUL <= 1 to be the most common case.


Mostly it's a matter of spotting something dumb and fixing it rather 
than having to answer questions later about dumb codegen.  I doubt any 
of these cases matter in practice.


Jeff
diff mbox series

Patch

diff --git a/gcc/config/riscv/constraints.md b/gcc/config/riscv/constraints.md
index 45f8e9602d2..9638942b733 100644
--- a/gcc/config/riscv/constraints.md
+++ b/gcc/config/riscv/constraints.md
@@ -70,6 +70,11 @@  (define_constraint "c08"
   (and (match_code "const_int")
        (match_test "ival == 8")))
 
+(define_constraint "P"
+  "A 5-bit signed immediate for vmv.v.i."
+  (and (match_code "const_int")
+       (match_test "IN_RANGE (ival, -16, 15)")))
+
 (define_constraint "K"
   "A 5-bit unsigned immediate for CSR access instructions."
   (and (match_code "const_int")
diff --git a/gcc/config/riscv/vector.md b/gcc/config/riscv/vector.md
index 7c8780dc7c7..b3038087aa5 100644
--- a/gcc/config/riscv/vector.md
+++ b/gcc/config/riscv/vector.md
@@ -2118,6 +2118,16 @@  (define_expand "@pred_broadcast<mode>"
       emit_move_insn (tmp, gen_int_mode (value, Pmode));
       operands[3] = gen_rtx_SIGN_EXTEND (<VEL>mode, tmp);
     }
+  /* Never load (const_int 0) into a register, that's silly.  */
+  else if (operands[3] == CONST0_RTX (<VEL>mode))
+    ;
+  /* If we're broadcasting [-16..15] across more than just
+     element 0, then we can use vmv.v.i directly, thus avoiding
+     the load of the constant into a GPR.  */
+  else if (CONST_INT_P (operands[3])
+	   && IN_RANGE (INTVAL (operands[3]), -16, 15)
+	   && !satisfies_constraint_Wb1 (operands[1]))
+    ;
   else
     operands[3] = force_reg (<VEL>mode, operands[3]);
 })
@@ -2134,18 +2144,18 @@  (define_insn_and_split "*pred_broadcast<mode>"
 	     (reg:SI VL_REGNUM)
 	     (reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE)
 	  (vec_duplicate:V_VLSI
-	    (match_operand:<VEL> 3 "direct_broadcast_operand"       " r,  r,Wdm,Wdm,Wdm,Wdm,  r,  r"))
-	  (match_operand:V_VLSI 2 "vector_merge_operand"            "vu,  0, vu,  0, vu,  0, vu,  0")))]
+	    (match_operand:<VEL> 3 "direct_broadcast_operand"       "rP,rP,Wdm,Wdm,Wdm,Wdm, rJ, rJ"))
+	  (match_operand:V_VLSI 2 "vector_merge_operand"            "vu, 0, vu,  0, vu,  0, vu,  0")))]
   "TARGET_VECTOR"
   "@
-   vmv.v.x\t%0,%3
-   vmv.v.x\t%0,%3
+   vmv.v.%o3\t%0,%3
+   vmv.v.%o3\t%0,%3
    vlse<sew>.v\t%0,%3,zero,%1.t
    vlse<sew>.v\t%0,%3,zero,%1.t
    vlse<sew>.v\t%0,%3,zero
    vlse<sew>.v\t%0,%3,zero
-   vmv.s.x\t%0,%3
-   vmv.s.x\t%0,%3"
+   vmv.s.x\t%0,%z3
+   vmv.s.x\t%0,%z3"
   "(register_operand (operands[3], <VEL>mode)
   || CONST_POLY_INT_P (operands[3]))
   && GET_MODE_BITSIZE (<VEL>mode) > GET_MODE_BITSIZE (Pmode)"