Message ID | 20231212111412.29351-1-xujiahao@loongson.cn |
---|---|
State | New |
Headers | show |
Series | [v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT. | expand |
On Tue, 2023-12-12 at 19:14 +0800, Jiahao Xu wrote: > Define LOGICAL_OP_NON_SHORT_CIRCUIT as 0, for a short-circuit branch, use the > short-circuit operation instead of the non-short-circuit operation. > > This gives a 1.8% improvement in SPECCPU 2017 fprate on 3A6000. In r14-15 we removed LOGICAL_OP_NON_SHORT_CIRCUIT definition because the default value (1 for all current LoongArch CPUs with branch_cost = 6) may reduce the number of conditional branch instructions. I guess here the problem is floating-point compare instruction is much more costly than other instructions but the fact is not correctly modeled yet. Could you try https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html where I've raised fp_add cost (which is used for estimating floating- point compare cost) to 5 instructions and see if it solves your problem without LOGICAL_OP_NON_SHORT_CIRCUIT? If not I guess you can try increasing the floating-point comparison cost more in loongarch_rtx_costs: case UNLT: /* Branch comparisons have VOIDmode, so use the first operand's mode instead. */ mode = GET_MODE (XEXP (x, 0)); if (FLOAT_MODE_P (mode)) { *total = loongarch_cost->fp_add; Try to make it fp_add + something? return false; } *total = loongarch_binary_cost (x, COSTS_N_INSNS (1), COSTS_N_INSNS (4), speed); return true; If adjusting the cost model does not work I'd say this is a middle-end issue and we should submit a bug report. > gcc/ChangeLog: > > * config/loongarch/loongarch.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Define. > > gcc/testsuite/ChangeLog: > > * gcc.target/loongarch/short-circuit.c: New test. > > diff --git a/gcc/config/loongarch/loongarch.h b/gcc/config/loongarch/loongarch.h > index f1350b6048f..880c576c35b 100644 > --- a/gcc/config/loongarch/loongarch.h > +++ b/gcc/config/loongarch/loongarch.h > @@ -869,6 +869,7 @@ typedef struct { > 1 is the default; other values are interpreted relative to that. */ > > #define BRANCH_COST(speed_p, predictable_p) loongarch_branch_cost > +#define LOGICAL_OP_NON_SHORT_CIRCUIT 0 > > /* Return the asm template for a conditional branch instruction. > OPCODE is the opcode's mnemonic and OPERANDS is the asm template for > diff --git a/gcc/testsuite/gcc.target/loongarch/short-circuit.c b/gcc/testsuite/gcc.target/loongarch/short-circuit.c > new file mode 100644 > index 00000000000..bed585ee172 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/loongarch/short-circuit.c > @@ -0,0 +1,19 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -ffast-math -fdump-tree-gimple" } */ > + > +int > +short_circuit (float *a) > +{ > + float t1x = a[0]; > + float t2x = a[1]; > + float t1y = a[2]; > + float t2y = a[3]; > + float t1z = a[4]; > + float t2z = a[5]; > + > + if (t1x > t2y || t2x < t1y || t1x > t2z || t2x < t1z || t1y > t2z || t2y < t1z) > + return 0; > + > + return 1; > +} > +/* { dg-final { scan-tree-dump-times "if" 6 "gimple" } } */
在 2023/12/12 下午7:26, Xi Ruoyao 写道: > On Tue, 2023-12-12 at 19:14 +0800, Jiahao Xu wrote: >> Define LOGICAL_OP_NON_SHORT_CIRCUIT as 0, for a short-circuit branch, use the >> short-circuit operation instead of the non-short-circuit operation. >> >> This gives a 1.8% improvement in SPECCPU 2017 fprate on 3A6000. > In r14-15 we removed LOGICAL_OP_NON_SHORT_CIRCUIT definition because the > default value (1 for all current LoongArch CPUs with branch_cost = 6) > may reduce the number of conditional branch instructions. > > I guess here the problem is floating-point compare instruction is much > more costly than other instructions but the fact is not correctly > modeled yet. Could you try > https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html > where I've raised fp_add cost (which is used for estimating floating- > point compare cost) to 5 instructions and see if it solves your problem > without LOGICAL_OP_NON_SHORT_CIRCUIT? I think this is not the same issue as the cost of floating-point comparison instructions. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT affects how the short-circuit branch, such as (A AND-IF B), is executed, and it is not directly related to the cost of floating-point comparison instructions. I will try to test it using SPECCPU 2017. > If not I guess you can try increasing the floating-point comparison cost > more in loongarch_rtx_costs: > > case UNLT: > /* Branch comparisons have VOIDmode, so use the first operand's > mode instead. */ > mode = GET_MODE (XEXP (x, 0)); > if (FLOAT_MODE_P (mode)) > { > *total = loongarch_cost->fp_add; > > > Try to make it fp_add + something? > > return false; > } > *total = loongarch_binary_cost (x, COSTS_N_INSNS (1), COSTS_N_INSNS (4), > speed); > return true; > > > If adjusting the cost model does not work I'd say this is a middle-end > issue and we should submit a bug report. > >> gcc/ChangeLog: >> >> * config/loongarch/loongarch.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Define. >> >> gcc/testsuite/ChangeLog: >> >> * gcc.target/loongarch/short-circuit.c: New test. >> >> diff --git a/gcc/config/loongarch/loongarch.h b/gcc/config/loongarch/loongarch.h >> index f1350b6048f..880c576c35b 100644 >> --- a/gcc/config/loongarch/loongarch.h >> +++ b/gcc/config/loongarch/loongarch.h >> @@ -869,6 +869,7 @@ typedef struct { >> 1 is the default; other values are interpreted relative to that. */ >> >> #define BRANCH_COST(speed_p, predictable_p) loongarch_branch_cost >> +#define LOGICAL_OP_NON_SHORT_CIRCUIT 0 >> >> /* Return the asm template for a conditional branch instruction. >> OPCODE is the opcode's mnemonic and OPERANDS is the asm template for >> diff --git a/gcc/testsuite/gcc.target/loongarch/short-circuit.c b/gcc/testsuite/gcc.target/loongarch/short-circuit.c >> new file mode 100644 >> index 00000000000..bed585ee172 >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/loongarch/short-circuit.c >> @@ -0,0 +1,19 @@ >> +/* { dg-do compile } */ >> +/* { dg-options "-O2 -ffast-math -fdump-tree-gimple" } */ >> + >> +int >> +short_circuit (float *a) >> +{ >> + float t1x = a[0]; >> + float t2x = a[1]; >> + float t1y = a[2]; >> + float t2y = a[3]; >> + float t1z = a[4]; >> + float t2z = a[5]; >> + >> + if (t1x > t2y || t2x < t1y || t1x > t2z || t2x < t1z || t1y > t2z || t2y < t1z) >> + return 0; >> + >> + return 1; >> +} >> +/* { dg-final { scan-tree-dump-times "if" 6 "gimple" } } */
On Tue, 2023-12-12 at 19:59 +0800, Jiahao Xu wrote: > > I guess here the problem is floating-point compare instruction is much > > more costly than other instructions but the fact is not correctly > > modeled yet. Could you try > > https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html > > where I've raised fp_add cost (which is used for estimating floating- > > point compare cost) to 5 instructions and see if it solves your problem > > without LOGICAL_OP_NON_SHORT_CIRCUIT? > I think this is not the same issue as the cost of floating-point > comparison instructions. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT > affects how the short-circuit branch, such as (A AND-IF B), is executed, > and it is not directly related to the cost of floating-point comparison > instructions. I will try to test it using SPECCPU 2017. The point is if the cost of floating-point comparison is very high, the middle end *should* short cut floating-point comparisons even if LOGICAL_OP_NON_SHORT_CIRCUIT = 1. I've created https://gcc.gnu.org/PR112985. Another factor regressing the code is we don't have modeled movcf2gr instruction yet, so we are not really eliding the branches as LOGICAL_OP_NON_SHORT_CIRCUIT = 1 supposes to do.
On Tue, 2023-12-12 at 20:39 +0800, Xi Ruoyao wrote: > On Tue, 2023-12-12 at 19:59 +0800, Jiahao Xu wrote: > > > I guess here the problem is floating-point compare instruction is much > > > more costly than other instructions but the fact is not correctly > > > modeled yet. Could you try > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html > > > where I've raised fp_add cost (which is used for estimating floating- > > > point compare cost) to 5 instructions and see if it solves your problem > > > without LOGICAL_OP_NON_SHORT_CIRCUIT? > > I think this is not the same issue as the cost of floating-point > > comparison instructions. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT > > affects how the short-circuit branch, such as (A AND-IF B), is executed, > > and it is not directly related to the cost of floating-point comparison > > instructions. I will try to test it using SPECCPU 2017. > > The point is if the cost of floating-point comparison is very high, the > middle end *should* short cut floating-point comparisons even if > LOGICAL_OP_NON_SHORT_CIRCUIT = 1. > > I've created https://gcc.gnu.org/PR112985. > > Another factor regressing the code is we don't have modeled movcf2gr > instruction yet, so we are not really eliding the branches as > LOGICAL_OP_NON_SHORT_CIRCUIT = 1 supposes to do. I made up this: diff --git a/gcc/config/loongarch/loongarch.md b/gcc/config/loongarch/loongarch.md index a5d0dcd65fe..84d828ebd0f 100644 --- a/gcc/config/loongarch/loongarch.md +++ b/gcc/config/loongarch/loongarch.md @@ -3169,6 +3169,42 @@ (define_insn "s<code>_<ANYF:mode>_using_FCCmode" [(set_attr "type" "fcmp") (set_attr "mode" "FCC")]) +(define_insn "movcf2gr<GPR:mode>" + [(set (match_operand:GPR 0 "register_operand" "=r") + (if_then_else:GPR (ne (match_operand:FCC 1 "register_operand" "z") + (const_int 0)) + (const_int 1) + (const_int 0)))] + "TARGET_HARD_FLOAT" + "movcf2gr\t%0,%1" + [(set_attr "type" "move") + (set_attr "mode" "FCC")]) + +(define_expand "cstore<ANYF:mode>4" + [(set (match_operand:SI 0 "register_operand") + (match_operator:SI 1 "loongarch_fcmp_operator" + [(match_operand:ANYF 2 "register_operand") + (match_operand:ANYF 3 "register_operand")]))] + "" + { + rtx fcc = gen_reg_rtx (FCCmode); + rtx cmp = gen_rtx_fmt_ee (GET_CODE (operands[1]), FCCmode, + operands[2], operands[3]); + + emit_insn (gen_rtx_SET (fcc, cmp)); + if (TARGET_64BIT) + { + rtx gpr = gen_reg_rtx (DImode); + emit_insn (gen_movcf2grdi (gpr, fcc)); + emit_insn (gen_rtx_SET (operands[0], + lowpart_subreg (SImode, gpr, DImode))); + } + else + emit_insn (gen_movcf2grsi (operands[0], fcc)); + + DONE; + }) + ;; ;; .................... diff --git a/gcc/config/loongarch/predicates.md b/gcc/config/loongarch/predicates.md index 9e9ce58cb53..83fea08315c 100644 --- a/gcc/config/loongarch/predicates.md +++ b/gcc/config/loongarch/predicates.md @@ -590,6 +590,10 @@ (define_predicate "order_operator" (define_predicate "loongarch_cstore_operator" (match_code "ne,eq,gt,gtu,ge,geu,lt,ltu,le,leu")) +(define_predicate "loongarch_fcmp_operator" + (match_code + "unordered,uneq,unlt,unle,eq,lt,le,ordered,ltgt,ne,ge,gt,unge,ungt")) + (define_predicate "small_data_pattern" (and (match_code "set,parallel,unspec,unspec_volatile,prefetch") (match_test "loongarch_small_data_pattern_p (op)"))) and now this function is compiled to (with LOGICAL_OP_NON_SHORT_CIRCUIT = 1): fld.s $f1,$r4,0 fld.s $f0,$r4,4 fld.s $f3,$r4,8 fld.s $f2,$r4,12 fcmp.slt.s $fcc1,$f0,$f3 fcmp.sgt.s $fcc0,$f1,$f2 movcf2gr $r13,$fcc1 movcf2gr $r12,$fcc0 or $r12,$r12,$r13 bnez $r12,.L3 fld.s $f4,$r4,16 fld.s $f5,$r4,20 or $r4,$r0,$r0 fcmp.sgt.s $fcc1,$f1,$f5 fcmp.slt.s $fcc0,$f0,$f4 movcf2gr $r12,$fcc1 movcf2gr $r13,$fcc0 or $r12,$r12,$r13 bnez $r12,.L2 fcmp.sgt.s $fcc1,$f3,$f5 fcmp.slt.s $fcc0,$f2,$f4 movcf2gr $r4,$fcc1 movcf2gr $r12,$fcc0 or $r4,$r4,$r12 xori $r4,$r4,1 slli.w $r4,$r4,0 jr $r1 .align 4 .L3: or $r4,$r0,$r0 .align 4 .L2: jr $r1 Per my micro-benchmark this is much faster than LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e. when the branches are not predictable). Note that there is a redundant slli.w instruction in the compiled code and I couldn't find a way to remove it (my trick in the TARGET_64BIT branch only works for simple examples). We may be able to handle via the ext_dce pass [1] in the future. [1]:https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637320.html
在 2023/12/13 上午2:27, Xi Ruoyao 写道: > > fld.s $f1,$r4,0 > fld.s $f0,$r4,4 > fld.s $f3,$r4,8 > fld.s $f2,$r4,12 > fcmp.slt.s $fcc1,$f0,$f3 > fcmp.sgt.s $fcc0,$f1,$f2 > movcf2gr $r13,$fcc1 > movcf2gr $r12,$fcc0 > or $r12,$r12,$r13 > bnez $r12,.L3 > fld.s $f4,$r4,16 > fld.s $f5,$r4,20 > or $r4,$r0,$r0 > fcmp.sgt.s $fcc1,$f1,$f5 > fcmp.slt.s $fcc0,$f0,$f4 > movcf2gr $r12,$fcc1 > movcf2gr $r13,$fcc0 > or $r12,$r12,$r13 > bnez $r12,.L2 > fcmp.sgt.s $fcc1,$f3,$f5 > fcmp.slt.s $fcc0,$f2,$f4 > movcf2gr $r4,$fcc1 > movcf2gr $r12,$fcc0 > or $r4,$r4,$r12 > xori $r4,$r4,1 > slli.w $r4,$r4,0 > jr $r1 > .align 4 > .L3: > or $r4,$r0,$r0 > .align 4 > .L2: > jr $r1 > > Per my micro-benchmark this is much faster than > LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e. > when the branches are not predictable). > > Note that there is a redundant slli.w instruction in the compiled code > and I couldn't find a way to remove it (my trick in the TARGET_64BIT > branch only works for simple examples). We may be able to handle via > the ext_dce pass [1] in the future. Patches in attachments can remove the remaining symbol extension directives from the assembly. > [1]:https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637320.html >
在 2023/12/13 上午2:27, Xi Ruoyao 写道: > On Tue, 2023-12-12 at 20:39 +0800, Xi Ruoyao wrote: > > fld.s $f1,$r4,0 > fld.s $f0,$r4,4 > fld.s $f3,$r4,8 > fld.s $f2,$r4,12 > fcmp.slt.s $fcc1,$f0,$f3 > fcmp.sgt.s $fcc0,$f1,$f2 > movcf2gr $r13,$fcc1 > movcf2gr $r12,$fcc0 There is also a problem that on 3A5000 MOVCF2GR requires 7 cycles, MOVCF2FR+MOVFR2GR is a cycle. 3A6000 has no problem. > or $r12,$r12,$r13 > bnez $r12,.L3 > fld.s $f4,$r4,16 > fld.s $f5,$r4,20 > or $r4,$r0,$r0 > fcmp.sgt.s $fcc1,$f1,$f5 > fcmp.slt.s $fcc0,$f0,$f4 > movcf2gr $r12,$fcc1 > movcf2gr $r13,$fcc0 > or $r12,$r12,$r13 > bnez $r12,.L2 > fcmp.sgt.s $fcc1,$f3,$f5 > fcmp.slt.s $fcc0,$f2,$f4 > movcf2gr $r4,$fcc1 > movcf2gr $r12,$fcc0 > or $r4,$r4,$r12 > xori $r4,$r4,1 > slli.w $r4,$r4,0 > jr $r1 > .align 4 > .L3: > or $r4,$r0,$r0 > .align 4 > .L2: > jr $r1 > > Per my micro-benchmark this is much faster than > LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e. > when the branches are not predictable). > > Note that there is a redundant slli.w instruction in the compiled code > and I couldn't find a way to remove it (my trick in the TARGET_64BIT > branch only works for simple examples). We may be able to handle via > the ext_dce pass [1] in the future. > > [1]:https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637320.html >
在 2023/12/13 上午2:27, Xi Ruoyao 写道: > On Tue, 2023-12-12 at 20:39 +0800, Xi Ruoyao wrote: >> On Tue, 2023-12-12 at 19:59 +0800, Jiahao Xu wrote: >>>> I guess here the problem is floating-point compare instruction is much >>>> more costly than other instructions but the fact is not correctly >>>> modeled yet. Could you try >>>> https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html >>>> where I've raised fp_add cost (which is used for estimating floating- >>>> point compare cost) to 5 instructions and see if it solves your problem >>>> without LOGICAL_OP_NON_SHORT_CIRCUIT? >>> I think this is not the same issue as the cost of floating-point >>> comparison instructions. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT >>> affects how the short-circuit branch, such as (A AND-IF B), is executed, >>> and it is not directly related to the cost of floating-point comparison >>> instructions. I will try to test it using SPECCPU 2017. >> The point is if the cost of floating-point comparison is very high, the >> middle end *should* short cut floating-point comparisons even if >> LOGICAL_OP_NON_SHORT_CIRCUIT = 1. >> >> I've created https://gcc.gnu.org/PR112985. >> >> Another factor regressing the code is we don't have modeled movcf2gr >> instruction yet, so we are not really eliding the branches as >> LOGICAL_OP_NON_SHORT_CIRCUIT = 1 supposes to do. > I made up this: > > diff --git a/gcc/config/loongarch/loongarch.md b/gcc/config/loongarch/loongarch.md > index a5d0dcd65fe..84d828ebd0f 100644 > --- a/gcc/config/loongarch/loongarch.md > +++ b/gcc/config/loongarch/loongarch.md > @@ -3169,6 +3169,42 @@ (define_insn "s<code>_<ANYF:mode>_using_FCCmode" > [(set_attr "type" "fcmp") > (set_attr "mode" "FCC")]) > > +(define_insn "movcf2gr<GPR:mode>" > + [(set (match_operand:GPR 0 "register_operand" "=r") > + (if_then_else:GPR (ne (match_operand:FCC 1 "register_operand" "z") > + (const_int 0)) > + (const_int 1) > + (const_int 0)))] > + "TARGET_HARD_FLOAT" > + "movcf2gr\t%0,%1" > + [(set_attr "type" "move") > + (set_attr "mode" "FCC")]) > + > +(define_expand "cstore<ANYF:mode>4" > + [(set (match_operand:SI 0 "register_operand") > + (match_operator:SI 1 "loongarch_fcmp_operator" > + [(match_operand:ANYF 2 "register_operand") > + (match_operand:ANYF 3 "register_operand")]))] > + "" > + { > + rtx fcc = gen_reg_rtx (FCCmode); > + rtx cmp = gen_rtx_fmt_ee (GET_CODE (operands[1]), FCCmode, > + operands[2], operands[3]); > + > + emit_insn (gen_rtx_SET (fcc, cmp)); > + if (TARGET_64BIT) > + { > + rtx gpr = gen_reg_rtx (DImode); > + emit_insn (gen_movcf2grdi (gpr, fcc)); > + emit_insn (gen_rtx_SET (operands[0], > + lowpart_subreg (SImode, gpr, DImode))); > + } > + else > + emit_insn (gen_movcf2grsi (operands[0], fcc)); > + > + DONE; > + }) > + > > > ;; > ;; .................... > diff --git a/gcc/config/loongarch/predicates.md b/gcc/config/loongarch/predicates.md > index 9e9ce58cb53..83fea08315c 100644 > --- a/gcc/config/loongarch/predicates.md > +++ b/gcc/config/loongarch/predicates.md > @@ -590,6 +590,10 @@ (define_predicate "order_operator" > (define_predicate "loongarch_cstore_operator" > (match_code "ne,eq,gt,gtu,ge,geu,lt,ltu,le,leu")) > > +(define_predicate "loongarch_fcmp_operator" > + (match_code > + "unordered,uneq,unlt,unle,eq,lt,le,ordered,ltgt,ne,ge,gt,unge,ungt")) > + > (define_predicate "small_data_pattern" > (and (match_code "set,parallel,unspec,unspec_volatile,prefetch") > (match_test "loongarch_small_data_pattern_p (op)"))) > > and now this function is compiled to (with LOGICAL_OP_NON_SHORT_CIRCUIT > = 1): > > fld.s $f1,$r4,0 > fld.s $f0,$r4,4 > fld.s $f3,$r4,8 > fld.s $f2,$r4,12 > fcmp.slt.s $fcc1,$f0,$f3 > fcmp.sgt.s $fcc0,$f1,$f2 > movcf2gr $r13,$fcc1 > movcf2gr $r12,$fcc0 > or $r12,$r12,$r13 > bnez $r12,.L3 > fld.s $f4,$r4,16 > fld.s $f5,$r4,20 > or $r4,$r0,$r0 > fcmp.sgt.s $fcc1,$f1,$f5 > fcmp.slt.s $fcc0,$f0,$f4 > movcf2gr $r12,$fcc1 > movcf2gr $r13,$fcc0 > or $r12,$r12,$r13 > bnez $r12,.L2 > fcmp.sgt.s $fcc1,$f3,$f5 > fcmp.slt.s $fcc0,$f2,$f4 > movcf2gr $r4,$fcc1 > movcf2gr $r12,$fcc0 > or $r4,$r4,$r12 > xori $r4,$r4,1 > slli.w $r4,$r4,0 > jr $r1 > .align 4 > .L3: > or $r4,$r0,$r0 > .align 4 > .L2: > jr $r1 > > Per my micro-benchmark this is much faster than > LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e. > when the branches are not predictable). > > Note that there is a redundant slli.w instruction in the compiled code > and I couldn't find a way to remove it (my trick in the TARGET_64BIT > branch only works for simple examples). We may be able to handle via > the ext_dce pass [1] in the future. > > [1]:https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637320.html > This test was extracted from the hot functions of 526.blender_r. Setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 resulted in a 26% decrease in dynamic instruction count and a 13.4% performance improvement. After applying the patch mentioned above, the assembly code looks much better with LOGICAL_OP_NON_SHORT_CIRCUIT=1, bringing an 11% improvement to 526. Based on this, setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 further improved the performance of 526 by 3%. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT determines how gimple is generated, while the optimizations you made determine how rtl is generated. They are not conflicting and combining them would yield better results. Currently, I have only tested it on 526, and I will continue testing its impact on the entire SPEC 2017 suite.
On Wed, 2023-12-13 at 14:17 +0800, Jiahao Xu wrote: > This test was extracted from the hot functions of 526.blender_r. Setting > LOGICAL_OP_NON_SHORT_CIRCUIT to 0 resulted in a 26% decrease in dynamic > instruction count and a 13.4% performance improvement. After applying > the patch mentioned above, the assembly code looks much better with > LOGICAL_OP_NON_SHORT_CIRCUIT=1, bringing an 11% improvement to 526. > Based on this, setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 further > improved the performance of 526 by 3%. The definition of > LOGICAL_OP_NON_SHORT_CIRCUIT determines how gimple is generated, while > the optimizations you made determine how rtl is generated. They are not > conflicting and combining them would yield better results. Currently, I > have only tested it on 526, and I will continue testing its impact on > the entire SPEC 2017 suite. The problem with LOGICAL_OP_NON_SHORT_CIRCUIT = 0 is it may regress fixed-point only code. In practice the usage of -ffast-math is very rare ("real" Linux packages invoking floating-point operations often just malfunction with it) and it seems not good to regress common cases with uncommon cases.
在 2023/12/13 下午2:21, Xi Ruoyao 写道: > On Wed, 2023-12-13 at 14:17 +0800, Jiahao Xu wrote: >> This test was extracted from the hot functions of 526.blender_r. Setting >> LOGICAL_OP_NON_SHORT_CIRCUIT to 0 resulted in a 26% decrease in dynamic >> instruction count and a 13.4% performance improvement. After applying >> the patch mentioned above, the assembly code looks much better with >> LOGICAL_OP_NON_SHORT_CIRCUIT=1, bringing an 11% improvement to 526. >> Based on this, setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 further >> improved the performance of 526 by 3%. The definition of >> LOGICAL_OP_NON_SHORT_CIRCUIT determines how gimple is generated, while >> the optimizations you made determine how rtl is generated. They are not >> conflicting and combining them would yield better results. Currently, I >> have only tested it on 526, and I will continue testing its impact on >> the entire SPEC 2017 suite. > The problem with LOGICAL_OP_NON_SHORT_CIRCUIT = 0 is it may regress > fixed-point only code. In practice the usage of -ffast-math is very > rare ("real" Linux packages invoking floating-point operations often > just malfunction with it) and it seems not good to regress common cases > with uncommon cases. > Setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 in SPEC2017 intrate benchmark results in a 1.6% decrease in dynamic instruction count and an overall performance improvement of 0.5%. Most of the SPEC2017 int programs experience a decrease in instruction count, and there are no instances of performance regression observed.
On Wed, 2023-12-13 at 14:32 +0800, Jiahao Xu wrote: > > 在 2023/12/13 下午2:21, Xi Ruoyao 写道: > > On Wed, 2023-12-13 at 14:17 +0800, Jiahao Xu wrote: > > > This test was extracted from the hot functions of 526.blender_r. Setting > > > LOGICAL_OP_NON_SHORT_CIRCUIT to 0 resulted in a 26% decrease in dynamic > > > instruction count and a 13.4% performance improvement. After applying > > > the patch mentioned above, the assembly code looks much better with > > > LOGICAL_OP_NON_SHORT_CIRCUIT=1, bringing an 11% improvement to 526. > > > Based on this, setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 further > > > improved the performance of 526 by 3%. The definition of > > > LOGICAL_OP_NON_SHORT_CIRCUIT determines how gimple is generated, while > > > the optimizations you made determine how rtl is generated. They are not > > > conflicting and combining them would yield better results. Currently, I > > > have only tested it on 526, and I will continue testing its impact on > > > the entire SPEC 2017 suite. > > The problem with LOGICAL_OP_NON_SHORT_CIRCUIT = 0 is it may regress > > fixed-point only code. In practice the usage of -ffast-math is very > > rare ("real" Linux packages invoking floating-point operations often > > just malfunction with it) and it seems not good to regress common cases > > with uncommon cases. > > > Setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 in SPEC2017 intrate benchmark > results in a 1.6% decrease in dynamic instruction count and an overall > performance improvement of 0.5%. Most of the SPEC2017 int programs > experience a decrease in instruction count, and there are no instances > of performance regression observed. Ok then. But add these info into commit message.
diff --git a/gcc/config/loongarch/loongarch.h b/gcc/config/loongarch/loongarch.h index f1350b6048f..880c576c35b 100644 --- a/gcc/config/loongarch/loongarch.h +++ b/gcc/config/loongarch/loongarch.h @@ -869,6 +869,7 @@ typedef struct { 1 is the default; other values are interpreted relative to that. */ #define BRANCH_COST(speed_p, predictable_p) loongarch_branch_cost +#define LOGICAL_OP_NON_SHORT_CIRCUIT 0 /* Return the asm template for a conditional branch instruction. OPCODE is the opcode's mnemonic and OPERANDS is the asm template for diff --git a/gcc/testsuite/gcc.target/loongarch/short-circuit.c b/gcc/testsuite/gcc.target/loongarch/short-circuit.c new file mode 100644 index 00000000000..bed585ee172 --- /dev/null +++ b/gcc/testsuite/gcc.target/loongarch/short-circuit.c @@ -0,0 +1,19 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ffast-math -fdump-tree-gimple" } */ + +int +short_circuit (float *a) +{ + float t1x = a[0]; + float t2x = a[1]; + float t1y = a[2]; + float t2y = a[3]; + float t1z = a[4]; + float t2z = a[5]; + + if (t1x > t2y || t2x < t1y || t1x > t2z || t2x < t1z || t1y > t2z || t2y < t1z) + return 0; + + return 1; +} +/* { dg-final { scan-tree-dump-times "if" 6 "gimple" } } */