Message ID | 022001da4819$20d652b0$6282f810$@nextmovesoftware.com |
---|---|
State | New |
Headers | show |
Series | PR rtl-optimization/111267: Improved forward propagation. | expand |
On Tue, Jan 16, 2024 at 2:13 AM Roger Sayle <roger@nextmovesoftware.com> wrote: > > > This patch resolves PR rtl-optimization/111267 by improving RTL-level > forward propagation. This x86_64 code quality regression was caused > (exposed) by my changes to improve how x86's (TImode) argument passing > is represented at the RTL-level (reducing the use of SUBREGs to catch > more optimization opportunities in combine). The pitfall is that the > more complex RTL representations expose a limitation in RTL's fwprop > pass. > > At the heart of fwprop, in try_fwprop_subst_pattern, the logic can > be summarized as three steps. Step 1 is a heuristic that rejects the > propagation attempt if the expression is too complex, step 2 calls > the backend's recog to see if the propagated/simplified instruction > is recognizable/valid, and step 3 then calls src_cost to compare > the rtx costs of the replacement vs. the original, and accepts the > transformation if the final cost is the same of better. > > The logic error (or missed optimization opportunity) is that the > step 1 heuristic that attempts to predict (second guess) the > process is flawed. Ultimately the decision on whether to fwprop > or not should depend solely on actual improvement, as measured > by RTX costs. Hence the prototype fix in the bugzilla PR removes > the heuristic of calling prop.profitable_p entirely, relying > entirely on the cost comparison in step 3. > > Unfortunately, things are a tiny bit more complicated. The cost > comparison in fwprop uses the older set_src_cost API and not the > newer (preffered) insn_cost API as currently used in combine. > This means that the cost improvement comparisons are only done > for single_set instructions (more complex PARALLELs etc. aren't > supported). Hence we can only rely on skipping step 1 for that > subset of instructions actually evaluated by step 3. > > The other subtlety is that to avoid potential infinite loops > in fwprop we should only reply purely on rtx costs when the > transformation is obviously an improvement. If the replacement > has the same cost as the original, we can use the prop.profitable_p > test to preserve the current behavior. > > Finally, to answer Richard Biener's remaining question about this > approach: yes, there is an asymmetry between how patterns are > handled and how REG_EQUAL notes are handled. For example, at > the moment propagation into notes doesn't use rtx costs at all, > and ultimately when fwprop is updated to use insn_cost, this > (and recog) obviously isn't applicable to notes. There's no reason > the logic need be identical between patterns and notes, and during > stage4 we only need update propagation into patterns to fix this > P1 regression (notes and use of cost_insn can be done for GCC 15). > > For Jakub's reduced testcase: > > struct S { float a, b, c, d; }; > int bar (struct S x, struct S y) { > return x.b <= y.d && x.c >= y.a; > } > > On x86_64-pc-linux-gnu with -O2 gcc currently generates: > > bar: movq %xmm2, %rdx > movq %xmm3, %rax > movq %xmm0, %rsi > xchgq %rdx, %rax > movq %rsi, %rcx > movq %rax, %rsi > movq %rdx, %rax > shrq $32, %rcx > shrq $32, %rax > movd %ecx, %xmm4 > movd %eax, %xmm0 > comiss %xmm4, %xmm0 > jb .L6 > movd %esi, %xmm0 > xorl %eax, %eax > comiss %xmm0, %xmm1 > setnb %al > ret > .L6: xorl %eax, %eax > ret > > with this simple patch to fwprop, we now generate: > > bar: shufps $85, %xmm0, %xmm0 > shufps $85, %xmm3, %xmm3 > comiss %xmm0, %xmm3 > jb .L6 > xorl %eax, %eax > comiss %xmm2, %xmm1 > setnb %al > ret > .L6: xorl %eax, %eax > ret > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check, both with and without --target_board=unix{-m32} > with no new failures. Additionally, it also resolves the FAIL for > gcc.target/i386/pr82580.c. Ok for mainline? Thanks for the detailed explanation. I think this deserves a comment about the later costing before - if (!prop.profitable_p ()) + if (!prop.profitable_p () + && (prop.changed_mem_p () + || use_insn->is_asm () + || !single_set (use_rtl))) (and generally the profitable_p method should better be called sth like likely_profitable_p?) I'd say OK with the additional comment added but please give Richard S. a chance to comment. Thanks, Richard. > > 2024-01-16 Roger Sayle <roger@nextmovesoftware.com> > > gcc/ChangeLog > PR rtl-optimization/111267 > * fwprop.cc (try_fwprop_subst_pattern): Only bail-out early when > !prop.profitable_p for instructions that are not single sets. > When comparing costs, bail-out if the cost is unchanged and > !prop.profitable_p. > > gcc/testsuite/ChangeLog > PR rtl-optimization/111267 > * gcc.target/i386/pr111267.c: New test case. > > > Thanks in advance (and to Jeff Law for his guidance/help), > Roger > -- >
diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc index 0c588f8..f06225a 100644 --- a/gcc/fwprop.cc +++ b/gcc/fwprop.cc @@ -449,7 +449,10 @@ try_fwprop_subst_pattern (obstack_watermark &attempt, insn_change &use_change, if (prop.num_replacements == 0) return false; - if (!prop.profitable_p ()) + if (!prop.profitable_p () + && (prop.changed_mem_p () + || use_insn->is_asm () + || !single_set (use_rtl))) { if (dump_file && (dump_flags & TDF_DETAILS)) fprintf (dump_file, "cannot propagate from insn %d into" @@ -481,7 +484,8 @@ try_fwprop_subst_pattern (obstack_watermark &attempt, insn_change &use_change, redo_changes (0); auto new_cost = set_src_cost (SET_SRC (use_set), GET_MODE (SET_DEST (use_set)), speed); - if (new_cost > old_cost) + if (new_cost > old_cost + || (new_cost == old_cost && !prop.profitable_p ())) { if (dump_file) fprintf (dump_file, "change not profitable" diff --git a/gcc/testsuite/gcc.target/i386/pr111267.c b/gcc/testsuite/gcc.target/i386/pr111267.c new file mode 100644 index 0000000..e3d549d --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr111267.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ +struct S { float a, b, c, d; }; + +int +bar (struct S x, struct S y) +{ + return x.b <= y.d && x.c >= y.a; +} + +/* { dg-final { scan-assembler-not "movq" } } */ +/* { dg-final { scan-assembler-not "xchgq" } } */ +/* { dg-final { scan-assembler-not "shrq" } } */ +/* { dg-final { scan-assembler-not "movd" } } */