Message ID | 00e701d8862c$bb2d2290$318767b0$@nextmovesoftware.com |
---|---|
State | New |
Headers | show |
Series | [x86] PR target/105930: Split *xordi3_doubleword after reload. | expand |
On Wed, Jun 22, 2022 at 1:39 PM Roger Sayle <roger@nextmovesoftware.com> wrote: > > > This patch addresses PR target/105930 which is an ia32 stack frame size > regression in high-register pressure XOR-rich cryptography functions > reported by Linus Torvalds. The underlying problem is once the limited > number of registers on the x86 are exhausted, the register allocator > has to decide which to spill, where some eviction choices lead to much > poorer code, but these consequences are difficult to predict in advance. > > The patch below, which splits xordi3_doubleword and iordi3_doubleword > after reload (instead of before), significantly reduces the amount of > spill code and stack frame size, in what might appear to be an arbitrary > choice. > > My explanation of this behaviour is that the mixing of pre-reload split > SImode instructions and post-reload split DImode instructions is > confusing some of the heuristics used by reload. One might think > that splitting early gives the register allocator more freedom to > use available registers, but in practice the constraint that double > word values occupy consecutive registers (when ultimately used as a > DImode value) is the greater constraint. Instead, I believe in this > case, the pseudo registers used in memory addressing, appear to be > double counted for split SImode instructions compared to unsplit > DImode instructions. For the reduced test case in comment #13, this > leads to %eax being used to hold the long-lived argument pointer "v", > blocking the use of the ax:dx pair for processing double word values. > The important lines are at the very top of the assembly output: > > GCC 11 [use %ecx to address memory, require a 24-byte stack frame] > sub esp, 24 > mov ecx, DWORD PTR [esp+40] > > GCC 12 [use %eax to address memory, require a 44-byte stack frame] > sub esp, 44 > mov eax, DWORD PTR [esp+64] The pre-reload vs. post-reload approach is a constant source of various regressions, since one or the other has its benefits and drawbacks. Please note that arithmetic and shift instructions remain on a post-reload approach, since there were various regressions w.r.t. register allocator observed. I think that the best approach for now would be to move *all* double-word splitters to the post-reload approach. As you exposed above, mixing pre-reload and post-reload approaches confuses reload too much due to the blocking of the part of a register pair. > Jakub's alternative proposed patch in comment #17 is to improve > consistency by splitting more instructions (rotates) before reload, > which shows a small improvement, but not to GCC v11 levels. So, I guess the experiment shows that we should revert to post-reload splitters, while keeping the improvements splitters gain from post-reload implementation. > I have some follow-up patches (based on other approaches I've tried), > including splitting rotations by 32 after reload, and supporting > TImode operations via <dwi>, notice this same pattern is mentioned in > https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596201.html > but this patch below is the core minimal fix that's hopefully > suitable for benchmarking and possibly backporting to the 12 branch. > I believe that changes to the register allocator itself, to tweak how > stack slots are assigned and which values can be cheaply materialized > are out-of-scope for the release branches. I assume that and/andnot should also be moved to a post-reload split approach. > > > I'm curious what folks (especially Uros and Jakub) think? > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check, both with and without --target_board=unix{-m32}, > with no new failures. CSiBE also shows a (minor) code size reduction. > It would be great if someone could benchmark this patch, or alternatively > it can be baked on mainline to let the automatic benchmarking evaluate it, > then revert the patch if there are any observed performance issues. > > Thoughts? > > > 2022-06-22 Roger Sayle <roger@nextmovesoftware.com> > > gcc/ChangeLog > PR target/105930 > * config/i386/i386.md (*<any_or>di3_doubleword): Split after > reload. Use rtx_equal_p to avoid creating memory-to-memory moves, > and emit NOTE_INSN_DELETED if operand[2] is zero (i.e. with -O0). > (define_insn <any_or><mode>_1): Renamed from *<any_or>mode_1 > to provide gen_<any_or>si_1 functions. Is there a reason to not use ix86_expand_{binary|unary}_operator? It fits perfectly here. Otherwise, the patch looks OK to me. Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 3093cb5..537fba21 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -10539,48 +10539,61 @@ "ix86_expand_binary_operator (<CODE>, <MODE>mode, operands); DONE;") (define_insn_and_split "*<code>di3_doubleword" - [(set (match_operand:DI 0 "nonimmediate_operand") + [(set (match_operand:DI 0 "nonimmediate_operand" "=ro,r") (any_or:DI - (match_operand:DI 1 "nonimmediate_operand") - (match_operand:DI 2 "x86_64_szext_general_operand"))) + (match_operand:DI 1 "nonimmediate_operand" "0,0") + (match_operand:DI 2 "x86_64_szext_general_operand" "re,o"))) (clobber (reg:CC FLAGS_REG))] "!TARGET_64BIT - && ix86_binary_operator_ok (<CODE>, DImode, operands) - && ix86_pre_reload_split ()" + && ix86_binary_operator_ok (<CODE>, DImode, operands)" "#" - "&& 1" + "&& reload_completed" [(const_int 0)] { + /* This insn may disappear completely when operands[2] == const0_rtx + and operands[0] == operands[1], which requires a NOTE_INSN_DELETED. */ + bool emit_insn_deleted_note_p = false; + split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]); if (operands[2] == const0_rtx) - emit_move_insn (operands[0], operands[1]); + { + if (!rtx_equal_p (operands[0], operands[1])) + emit_move_insn (operands[0], operands[1]); + else + emit_insn_deleted_note_p = true; + } else if (operands[2] == constm1_rtx) { if (<CODE> == IOR) emit_move_insn (operands[0], constm1_rtx); else - ix86_expand_unary_operator (NOT, SImode, &operands[0]); + emit_insn (gen_one_cmplsi2 (operands[0], operands[1])); } else - ix86_expand_binary_operator (<CODE>, SImode, &operands[0]); + emit_insn (gen_<code>si_1 (operands[0], operands[1], operands[2])); if (operands[5] == const0_rtx) - emit_move_insn (operands[3], operands[4]); + { + if (!rtx_equal_p (operands[3], operands[4])) + emit_move_insn (operands[3], operands[4]); + else if (emit_insn_deleted_note_p) + emit_note (NOTE_INSN_DELETED); + } else if (operands[5] == constm1_rtx) { if (<CODE> == IOR) emit_move_insn (operands[3], constm1_rtx); else - ix86_expand_unary_operator (NOT, SImode, &operands[3]); + emit_insn (gen_one_cmplsi2 (operands[3], operands[4])); } else - ix86_expand_binary_operator (<CODE>, SImode, &operands[3]); + emit_insn (gen_<code>si_1 (operands[3], operands[4], operands[5])); DONE; }) -(define_insn "*<code><mode>_1" +(define_insn "<code><mode>_1" [(set (match_operand:SWI248 0 "nonimmediate_operand" "=rm,r,?k") (any_or:SWI248 (match_operand:SWI248 1 "nonimmediate_operand" "%0,0,k")