Message ID | 25c5da43-6da0-41bb-a3e8-9c50f73ad90f@ventanamicro.com |
---|---|
State | New |
Headers | show |
Series | [to-be-committed,RISC-V] Slightly improve broadcasting small constants into vectors | expand |
On Fri, Oct 11, 2024 at 6:26 AM Jeff Law <jlaw@ventanamicro.com> wrote: > > I probably spent way more time on this than it's worth... > > I was looking at the code we generate for vector SAD and noticed that we > were being a bit silly. Specifically: > > li a4,0 # 272 [c=4 l=4] *movsi_internal/1 > > Followed shortly by: > > vmv.s.x v3,a4 # 261 [c=4 l=4] *pred_broadcastrvvm1si/6 > > And no other uses of a4. We could have used x0 trivially. > > First we adjust the expander so that it doesn't force the constant into > a register. In the matching pattern we change the appropriate source > constraints from "r" to "rJ" and the output template is changed to use > %z for the operand. The net is we drop the li completely and emit > vmv.s.x,v3,x0. > > But wait, there's more. If we're broadcasting a constant in the range > [-16..15] into a vector, we currently load the constant into a register > and use vmv.v.r. We can instead use vmv.v.i, which avoids loading the > constant into a GPR. For that case we again avoid forcing the constant > into a register in the expander and adjust the output template to emit > vmv.v.x or vmv.v.i based on whether or not the appropriate operand is a > constant or general purpose register. So again, we'll drop a load > immediate into a scalar for this case. > > Whether or not we should use vmv.v.i vs vmv.s.x for loading [-16..15] > into the 0th element is probably uarch dependent. The tradeoff is > loading the GPR vs the broadcast in the vector unit. I didn't bother > with this case. Note that this tradeoff is only interesting when LMUL is small. When LMUL is large, vmv.v.i does a lot more work than vmv.s.x (writing multiple vector registers versus just one). > > Tested in my tester (which tests rv64gcv as a default codegen option). > Will wait for the pre-commit tester to render a verdict. > > Jeff
On 10/11/24 5:40 PM, Andrew Waterman wrote: >> Whether or not we should use vmv.v.i vs vmv.s.x for loading [-16..15] >> into the 0th element is probably uarch dependent. The tradeoff is >> loading the GPR vs the broadcast in the vector unit. I didn't bother >> with this case. > > Note that this tradeoff is only interesting when LMUL is small. When > LMUL is large, vmv.v.i does a lot more work than vmv.s.x (writing > multiple vector registers versus just one). Very true and I would expect LMUL <= 1 to be the most common case. Mostly it's a matter of spotting something dumb and fixing it rather than having to answer questions later about dumb codegen. I doubt any of these cases matter in practice. Jeff
diff --git a/gcc/config/riscv/constraints.md b/gcc/config/riscv/constraints.md index 45f8e9602d2..9638942b733 100644 --- a/gcc/config/riscv/constraints.md +++ b/gcc/config/riscv/constraints.md @@ -70,6 +70,11 @@ (define_constraint "c08" (and (match_code "const_int") (match_test "ival == 8"))) +(define_constraint "P" + "A 5-bit signed immediate for vmv.v.i." + (and (match_code "const_int") + (match_test "IN_RANGE (ival, -16, 15)"))) + (define_constraint "K" "A 5-bit unsigned immediate for CSR access instructions." (and (match_code "const_int") diff --git a/gcc/config/riscv/vector.md b/gcc/config/riscv/vector.md index 7c8780dc7c7..b3038087aa5 100644 --- a/gcc/config/riscv/vector.md +++ b/gcc/config/riscv/vector.md @@ -2118,6 +2118,16 @@ (define_expand "@pred_broadcast<mode>" emit_move_insn (tmp, gen_int_mode (value, Pmode)); operands[3] = gen_rtx_SIGN_EXTEND (<VEL>mode, tmp); } + /* Never load (const_int 0) into a register, that's silly. */ + else if (operands[3] == CONST0_RTX (<VEL>mode)) + ; + /* If we're broadcasting [-16..15] across more than just + element 0, then we can use vmv.v.i directly, thus avoiding + the load of the constant into a GPR. */ + else if (CONST_INT_P (operands[3]) + && IN_RANGE (INTVAL (operands[3]), -16, 15) + && !satisfies_constraint_Wb1 (operands[1])) + ; else operands[3] = force_reg (<VEL>mode, operands[3]); }) @@ -2134,18 +2144,18 @@ (define_insn_and_split "*pred_broadcast<mode>" (reg:SI VL_REGNUM) (reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE) (vec_duplicate:V_VLSI - (match_operand:<VEL> 3 "direct_broadcast_operand" " r, r,Wdm,Wdm,Wdm,Wdm, r, r")) - (match_operand:V_VLSI 2 "vector_merge_operand" "vu, 0, vu, 0, vu, 0, vu, 0")))] + (match_operand:<VEL> 3 "direct_broadcast_operand" "rP,rP,Wdm,Wdm,Wdm,Wdm, rJ, rJ")) + (match_operand:V_VLSI 2 "vector_merge_operand" "vu, 0, vu, 0, vu, 0, vu, 0")))] "TARGET_VECTOR" "@ - vmv.v.x\t%0,%3 - vmv.v.x\t%0,%3 + vmv.v.%o3\t%0,%3 + vmv.v.%o3\t%0,%3 vlse<sew>.v\t%0,%3,zero,%1.t vlse<sew>.v\t%0,%3,zero,%1.t vlse<sew>.v\t%0,%3,zero vlse<sew>.v\t%0,%3,zero - vmv.s.x\t%0,%3 - vmv.s.x\t%0,%3" + vmv.s.x\t%0,%z3 + vmv.s.x\t%0,%z3" "(register_operand (operands[3], <VEL>mode) || CONST_POLY_INT_P (operands[3])) && GET_MODE_BITSIZE (<VEL>mode) > GET_MODE_BITSIZE (Pmode)"