Message ID | a4d5b600-2e21-3fea-17f8-4c2764880409@gmail.com |
---|---|
State | New |
Headers | show |
Series | [RFA] Improve strcmp expansion when one input is a constant string. | expand |
On Sun, Jun 4, 2023 at 11:41 PM Jeff Law via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > While investigating a RISC-V backend patch from Jivan I noticed a > regression in terms of dynamic instruction counts for the omnetpp > benchmark in spec2017. > > https://gcc.gnu.org/pipermail/gcc-patches/2023-June/620577.html > > The code we we with Jivan's patch at expansion time looks like this for > each character in the input string: > > > > (insn 6 5 7 (set (reg:SI 137) > (zero_extend:SI (mem:QI (reg/v/f:DI 135 [ x ]) [0 MEM > <char[1:12]> [(void *)x_2(D)]+0 S1 A8]))) "j.c":5:11 -1 > (nil)) > > (insn 7 6 8 (set (reg:DI 138) > (sign_extend:DI (plus:SI (reg:SI 137) > (const_int -108 [0xffffffffffffff94])))) "j.c":5:11 -1 > (nil)) > > (insn 8 7 9 (set (reg:SI 136) > (subreg/s/u:SI (reg:DI 138) 0)) "j.c":5:11 -1 > (expr_list:REG_EQUAL (plus:SI (reg:SI 137) > (const_int -108 [0xffffffffffffff94])) > (nil))) > > (insn 9 8 10 (set (reg:DI 139) > (sign_extend:DI (reg:SI 136))) "j.c":5:11 -1 > (nil)) > > (jump_insn 10 9 11 (set (pc) > (if_then_else (ne (reg:DI 139) > (const_int 0 [0])) > (label_ref 64) > (pc))) "j.c":5:11 -1 > (nil)) > > > Ignore insn 9. fwprop will turn it into a trivial copy from r138->r139 > which will ultimately propagate away. > > > All the paths eventually transfer to control to the label in question, > either by jumping or falling thru on the last character. After a bit of > cleanup by fwprop & friends we have: > > > > > (insn 6 3 7 2 (set (reg:SI 137 [ MEM <char[1:12]> [(void *)x_2(D)] ]) > > (zero_extend:SI (mem:QI (reg/v/f:DI 135 [ x ]) [0 MEM <char[1:12]> [(void *)x_2(D)]+0 S1 A8]))) "j.c":5:11 114 {zero_extendqisi2} > > (nil)) > > (insn 7 6 8 2 (set (reg:DI 138) > > (sign_extend:DI (plus:SI (reg:SI 137 [ MEM <char[1:12]> [(void *)x_2(D)] ]) > > (const_int -108 [0xffffffffffffff94])))) "j.c":5:11 6 {addsi3_extended} > > (expr_list:REG_DEAD (reg:SI 137 [ MEM <char[1:12]> [(void *)x_2(D)] ]) > > (nil))) > > (insn 8 7 10 2 (set (reg:SI 136 [ MEM <char[1:12]> [(void *)x_2(D)]+11 ]) > > (subreg/s/u:SI (reg:DI 138) 0)) "j.c":5:11 180 {*movsi_internal} > > (nil)) > > (jump_insn 10 8 73 2 (set (pc) > > (if_then_else (ne (reg:DI 138) > > (const_int 0 [0])) > > (label_ref 64) > > (pc))) "j.c":5:11 243 {*branchdi} > > (expr_list:REG_DEAD (reg:DI 138) > > (int_list:REG_BR_PROB 536870916 (nil))) > > -> 64) > > > insn 8 is the result of wanting the ultimate result of the strcmp to be > an "int" type (SImode). Note that (reg 136) is the result of the > strcmp. It gets set in each fragment of code that compares one element > in the string. It's also live after the strcmp sequence. As a result > combine isn't going to be able to clean this up. > > Note how (reg 136) births while (reg 138) is live and even though (reg > 136) is a copy of (reg 138), IRA doesn't have the necessary code to > determine that the regs do not conflict. As a result (reg 136) and (reg > 138) must be allocated different hard registers and we get code like this: > > > lbu a5,0(a0) # 6 [c=28 l=4] zero_extendqisi2/1 > > addiw a5,a5,-108 # 7 [c=8 l=4] addsi3_extended/1 > > mv a4,a5 # 8 [c=4 l=4] *movsi_internal/0 > > bne a5,zero,.L2 # 10 [c=4 l=4] *branchdi > > Note the annoying "mv". > > > Rather than do a conversion for each character, we could do each step in > word_mode and do the conversion once at the end of the whole sequence. > > So for each character we expand to: > > > (insn 6 5 7 (set (reg:DI 138) > > (zero_extend:DI (mem:QI (reg/v/f:DI 135 [ x ]) [0 MEM <char[1:12]> [(void *)x_2(D)]+0 S1 A8]))) "j.c":5:11 -1 > > (nil)) > > > > (insn 7 6 8 (set (reg:DI 137) > > (plus:DI (reg:DI 138) > > (const_int -108 [0xffffffffffffff94]))) "j.c":5:11 -1 > > (nil)) > > > > (jump_insn 8 7 9 (set (pc) > > (if_then_else (ne (reg:DI 137) > > (const_int 0 [0])) > > (label_ref 41) > > (pc))) "j.c":5:11 -1 > > (nil)) > > Good. Then at the end of the sequence we have: > > (code_label 41 40 42 2 (nil) [0 uses]) > > > > (insn 42 41 43 (set (reg:SI 136) > > (subreg:SI (reg:DI 137) 0)) "j.c":5:11 -1 > > (nil)) > > Which seems like exactly what we want. At the assembly level we get: > lbu a5,0(a0) # 6 [c=28 l=4] zero_extendqidi2/1 > addi a0,a5,-108 # 7 [c=4 l=4] adddi3/1 > bne a0,zero,.L2 # 8 [c=4 l=4] *branchdi > [ ... ] > > At the end of the sequence we realize the narrowing subreg followed by > an extnesion isn't necessary and just remove them. > > The ultimate result is omnetpp goes from a small regression to a small > overall improvement with Jivan's patch. > > Bootstrapped and regression tested on x86. Also built and run spec2017 > on riscv64. > > OK for the trunk? But then for example x86 has smaller encoding for byte ops and while widening is easily done later, truncation is not. Btw, you failed to update the overall function comment which lists the conversions applied. Note I would have expected to use the mode of the load so we truly elide some extensions, using word_mode looks like just another mode here? The key to note is probably op0 = convert_modes (mode, unit_mode, op0, 1); op1 = convert_modes (mode, unit_mode, op1, 1); rtx diff = expand_simple_binop (mode, MINUS, op0, op1, result, 1, OPTAB_WIDEN); which uses OPTAB_WIDEN - wouldn't it be better to pass in the unconverted modes and leave the decision which mode to use to OPTAB_WIDEN? Should we somehow query the target for the smallest mode from unit_mode it can do both the MINUS and the compare? Thanks, Richard. > > Jeff
On 6/5/23 00:29, Richard Biener wrote: > > But then for example x86 has smaller encoding for byte ops and while > widening is easily done later, truncation is not. Which probably argues we need to be checking costs. > > Btw, you failed to update the overall function comment which lists > the conversions applied. ACK. It occurred to me when I woke up today that I also failed to handle the case where word_mode is actually smaller than an int. > > Note I would have expected to use the mode of the load so we truly > elide some extensions, using word_mode looks like just another > mode here? The key to note is probably > > op0 = convert_modes (mode, unit_mode, op0, 1); > op1 = convert_modes (mode, unit_mode, op1, 1); > rtx diff = expand_simple_binop (mode, MINUS, op0, op1, > result, 1, OPTAB_WIDEN); On many (most?) targets the loads can cheaply extend to word_mode. > > which uses OPTAB_WIDEN - wouldn't it be better to pass in the > unconverted modes and leave the decision which mode to use > to OPTAB_WIDEN? Should we somehow query the target for > the smallest mode from unit_mode it can do both the MINUS > and the compare? I'll play with it. My worry is that the optabs are going to leave things in SImode. RV64 has 32 bit add/subtract which implicitly sign extends the result to 64 bits and Jivan's patch models that behavior with generally very good results. But again, I'll play with it. I do agree that we need to be looking at cost modeling more in here. So I'll poke at that too. jeff
On 6/5/23 00:29, Richard Biener wrote: > > But then for example x86 has smaller encoding for byte ops and while > widening is easily done later, truncation is not. Sadly, the x86 costing looks totally bogus here. We actually emit the exact same code for a QI mode loads vs a zero-extending load from QI to SI. But the costing is different and would tend to prefer QImode. That in turn is going to force an extension at the end of the sequence which would be a regression relative to the current code. Additionally we may get partial register stalls for the byte ops to implement the comparison steps. The net result is that querying the backend's costs would do the exact opposite of what I think we want on x86. One could argue the x86 maintainers should improve this situation... > > Note I would have expected to use the mode of the load so we truly > elide some extensions, using word_mode looks like just another > mode here? The key to note is probably > > op0 = convert_modes (mode, unit_mode, op0, 1); > op1 = convert_modes (mode, unit_mode, op1, 1); > rtx diff = expand_simple_binop (mode, MINUS, op0, op1, > result, 1, OPTAB_WIDEN); > > which uses OPTAB_WIDEN - wouldn't it be better to pass in the > unconverted modes and leave the decision which mode to use > to OPTAB_WIDEN? Should we somehow query the target for > the smallest mode from unit_mode it can do both the MINUS > and the compare? And avoiding OPTAB_WIDEN isn't going to help rv64 at all. The core issue being that we do define 32bit ops. With Jivan's patch those 32bit ops expose the sign extending nature. So a 32bit add would look something like (set (temp:DI) (sign_extend:DI (plus:SI (op:SI) (op:SI)))) (set (res:SI) (subreg:SI (temp:DI) 0) Where we mark the subreg with SUBREG_PROMOTED_VAR_P. I'm not sure the best way to proceed now. I could just put this on the back-burner as it's RISC-V specific and the gains elsewhere dwarf this issue. jeff
On Mon, Jun 5, 2023 at 8:41 PM Jeff Law <jeffreyalaw@gmail.com> wrote: > > > > On 6/5/23 00:29, Richard Biener wrote: > > > > > But then for example x86 has smaller encoding for byte ops and while > > widening is easily done later, truncation is not. > Sadly, the x86 costing looks totally bogus here. We actually emit the > exact same code for a QI mode loads vs a zero-extending load from QI to > SI. But the costing is different and would tend to prefer QImode. That > in turn is going to force an extension at the end of the sequence which > would be a regression relative to the current code. Additionally we may > get partial register stalls for the byte ops to implement the comparison > steps. > > The net result is that querying the backend's costs would do the exact > opposite of what I think we want on x86. One could argue the x86 > maintainers should improve this situation... > > > > > Note I would have expected to use the mode of the load so we truly > > elide some extensions, using word_mode looks like just another > > mode here? The key to note is probably > > > > op0 = convert_modes (mode, unit_mode, op0, 1); > > op1 = convert_modes (mode, unit_mode, op1, 1); > > rtx diff = expand_simple_binop (mode, MINUS, op0, op1, > > result, 1, OPTAB_WIDEN); > > > > which uses OPTAB_WIDEN - wouldn't it be better to pass in the > > unconverted modes and leave the decision which mode to use > > to OPTAB_WIDEN? Should we somehow query the target for > > the smallest mode from unit_mode it can do both the MINUS > > and the compare? > And avoiding OPTAB_WIDEN isn't going to help rv64 at all. The core > issue being that we do define 32bit ops. With Jivan's patch those 32bit > ops expose the sign extending nature. So a 32bit add would look > something like > > (set (temp:DI) (sign_extend:DI (plus:SI (op:SI) (op:SI)))) > (set (res:SI) (subreg:SI (temp:DI) 0) > > Where we mark the subreg with SUBREG_PROMOTED_VAR_P. > > > I'm not sure the best way to proceed now. I could just put this on the > back-burner as it's RISC-V specific and the gains elsewhere dwarf this > issue. I wonder if there's some more generic target macro we can key the behavior off - SLOW_BYTE_ACCESS isn't a good fit, WORD_REGISTER_OPERATIONS is maybe closer but it's exact implications are unknown to me. Maybe there's something else as well ... The point about OPTAB_WIDEN above was that I wonder why we extend 'op0' and 'op1' before emitting the binop when we allow WIDEN anyway. Yes, we want the result in 'mode' (but why? As you say we can extend at the end) and there's likely no way to tell expand_simple_binop to "expand as needed and not narrow the result". So I wonder if we should emulate that somehow (also taking into consideration the compare). Richard. > > jeff
On 6/6/23 00:47, Richard Biener wrote: > > I wonder if there's some more generic target macro we can key the > behavior off - SLOW_BYTE_ACCESS isn't a good fit, WORD_REGISTER_OPERATIONS > is maybe closer but it's exact implications are unknown to me. Maybe > there's something else as well ... LOAD_EXTEND_OP might help here, at least on some targets. Though not on x86. > > The point about OPTAB_WIDEN above was that I wonder why we > extend 'op0' and 'op1' before emitting the binop when we allow WIDEN > anyway. Ahh. I misunderstood. However, I think dropping the pre-widening will result in byte ops on x86 which may not be wise given the partial register stall problem that exists on some variants. Yes, we want the result in 'mode' (but why? As you say we > can extend at the end) and there's likely no way to tell expand_simple_binop > to "expand as needed and not narrow the result". So I wonder if we should > emulate that somehow (also taking into consideration the compare). That's what I felt I was starting to build. Essentially looking at costing (and probably other stuff eventually, like the ability to compare/branch on narrower modes) to make a determination about whether or not to do the operations in narrow or wider modes. With the costing so mucked up on x86 though, I'm hesitant to pursue this path further at this time. Jeff
diff --git a/gcc/builtins.cc b/gcc/builtins.cc index 8400adaf5b4..f2e0d3b7d7f 100644 --- a/gcc/builtins.cc +++ b/gcc/builtins.cc @@ -7135,6 +7135,9 @@ inline_string_cmp (rtx target, tree var_str, const char *const_str, scalar_int_mode unit_mode = as_a <scalar_int_mode> TYPE_MODE (unit_type_node); + /* We do the intermediate steps in WORD_MODE, then convert + to mode at the end of the sequence. */ + rtx intermediate_result = gen_reg_rtx (word_mode); start_sequence (); for (unsigned HOST_WIDE_INT i = 0; i < length; i++) @@ -7145,25 +7148,27 @@ inline_string_cmp (rtx target, tree var_str, const char *const_str, rtx op0 = (const_str_n == 1) ? const_rtx : var_rtx; rtx op1 = (const_str_n == 1) ? var_rtx : const_rtx; - op0 = convert_modes (mode, unit_mode, op0, 1); - op1 = convert_modes (mode, unit_mode, op1, 1); - rtx diff = expand_simple_binop (mode, MINUS, op0, op1, - result, 1, OPTAB_WIDEN); + op0 = convert_modes (word_mode, unit_mode, op0, 1); + op1 = convert_modes (word_mode, unit_mode, op1, 1); + rtx diff = expand_simple_binop (word_mode, MINUS, op0, op1, + intermediate_result, 1, OPTAB_WIDEN); - /* Force the difference into result register. We cannot reassign - result here ("result = diff") or we may end up returning - uninitialized result when expand_simple_binop allocates a new - pseudo-register for returning. */ - if (diff != result) - emit_move_insn (result, diff); + /* Force the difference into intermediate result register. We cannot + reassign the intermediate result here ("intermediate_result = diff") + or we may end up returning uninitialized result when + expand_simple_binop allocates a new pseudo-register for returning. */ + if (diff != intermediate_result) + emit_move_insn (intermediate_result, diff); if (i < length - 1) - emit_cmp_and_jump_insns (result, CONST0_RTX (mode), NE, NULL_RTX, - mode, true, ne_label); + emit_cmp_and_jump_insns (intermediate_result, CONST0_RTX (word_mode), + NE, NULL_RTX, word_mode, true, ne_label); offset += GET_MODE_SIZE (unit_mode); } emit_label (ne_label); + /* Now convert the intermediate result to the final result. */ + emit_move_insn (result, gen_lowpart (mode, intermediate_result)); rtx_insn *insns = get_insns (); end_sequence (); emit_insn (insns);