Message ID | 20240905185257.22411-1-palmer@rivosinc.com |
---|---|
State | New |
Headers | show |
Series | RISC-V: Define LOGICAL_OP_NON_SHORT_CIRCUIT to 1 [PR116615] | expand |
On Thu, 05 Sep 2024 11:52:57 PDT (-0700), Palmer Dabbelt wrote: > We have cheap logical ops, so let's just move this back to the default > to take advantage of the standard branch/op hueristics. > > gcc/ChangeLog: > > PR target/116615 > * config/riscv/riscv.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Remove. > --- > There's a bunch more discussion in the bug, but it's starting to smell > like this was just a holdover from MIPS (where maybe it also shouldn't > be set). I haven't tested this, but I figured I'd send the patch to get > a little more visibility. > > I guess we should also kick off something like a SPEC run to make sure > there's no regressions? Sorry I missed it in the bug, but Ruoyao points to dddafe94823 ("LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT") where short-circuiting the FP comparisons helps on LoongArch. Not sure if I'm also missing something here, but it kind of feels like that should be handled by a more generic optimization decision that just globally "should we short circuit logical ops" -- assuming it really is the FP comparisons that are causing the cost, as opposed to the actual logical ops themselves. Probably best to actually run the benchmarks, though... > --- > gcc/config/riscv/riscv.h | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/gcc/config/riscv/riscv.h b/gcc/config/riscv/riscv.h > index ead97867eb8..a0ccd1fc762 100644 > --- a/gcc/config/riscv/riscv.h > +++ b/gcc/config/riscv/riscv.h > @@ -939,8 +939,6 @@ extern enum riscv_cc get_riscv_cc (const rtx use); > #define TARGET_VECTOR_MISALIGN_SUPPORTED \ > riscv_vector_unaligned_access_p > > -#define LOGICAL_OP_NON_SHORT_CIRCUIT 0 > - > /* Control the assembler format that we output. */ > > /* Output to assembler file text saying following lines
On Thu, 2024-09-05 at 11:59 -0700, Palmer Dabbelt wrote: > On Thu, 05 Sep 2024 11:52:57 PDT (-0700), Palmer Dabbelt wrote: > > We have cheap logical ops, so let's just move this back to the default > > to take advantage of the standard branch/op hueristics. > > > > gcc/ChangeLog: > > > > PR target/116615 > > * config/riscv/riscv.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Remove. > > --- > > There's a bunch more discussion in the bug, but it's starting to smell > > like this was just a holdover from MIPS (where maybe it also shouldn't > > be set). I haven't tested this, but I figured I'd send the patch to get > > a little more visibility. > > > > I guess we should also kick off something like a SPEC run to make sure > > there's no regressions? > > Sorry I missed it in the bug, but Ruoyao points to dddafe94823 > ("LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT") where > short-circuiting the FP comparisons helps on LoongArch. > > Not sure if I'm also missing something here, but it kind of feels like > that should be handled by a more generic optimization decision that just > globally "should we short circuit logical ops" -- assuming it really is > the FP comparisons that are causing the cost, as opposed to the actual > logical ops themselves. IIUC there are some contributing factors here: 1. On LoongArch FP comparison is slow (costing 5 cycles). 2. On LoongArch the FP comparison result is stored into FCC registers, and to do logical operations on two comparison results they need to be moved into GPR first. The move costs one or two cycles (depending on the uarch). and maybe 3. The FP comparison result in the SPEC tests are somewhat predictable. IIRC when I tested dddafe94823 I made a test program where the FP comparison results are "randomized" (so the branch predictor is defeated), then the branch-less code generated with -Ofast --param logical-op-non-short-circuit=1 was actually faster than the code generated with -Ofast --param logical-op-non-short-circuit=0. AFAIK 2 isn't an issue for RISC-V (where FP comparison result is just in GPR) but 1 and 3 may still need to be considered.
On 9/5/24 12:52 PM, Palmer Dabbelt wrote: > We have cheap logical ops, so let's just move this back to the default > to take advantage of the standard branch/op hueristics. > > gcc/ChangeLog: > > PR target/116615 > * config/riscv/riscv.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Remove. > --- > There's a bunch more discussion in the bug, but it's starting to smell > like this was just a holdover from MIPS (where maybe it also shouldn't > be set). I haven't tested this, but I figured I'd send the patch to get > a little more visibility. > > I guess we should also kick off something like a SPEC run to make sure > there's no regressions? Yea, I'd definitely want to see some hard data on an implementation for this. I wouldn't want to rely just on icounts and eyeballing given its dependent on branch predictor accuracy and such. BPI is probably the best platform for this kind of testing right now. I probably can't spin it this week, but probably could next week. jeff
On Thu, Sep 5, 2024 at 2:52 PM Jeff Law <jeffreyalaw@gmail.com> wrote: > > > > On 9/5/24 12:52 PM, Palmer Dabbelt wrote: > > We have cheap logical ops, so let's just move this back to the default > > to take advantage of the standard branch/op hueristics. > > > > gcc/ChangeLog: > > > > PR target/116615 > > * config/riscv/riscv.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Remove. > > --- > > There's a bunch more discussion in the bug, but it's starting to smell > > like this was just a holdover from MIPS (where maybe it also shouldn't > > be set). I haven't tested this, but I figured I'd send the patch to get > > a little more visibility. > > > > I guess we should also kick off something like a SPEC run to make sure > > there's no regressions? > Yea, I'd definitely want to see some hard data on an implementation for > this. I wouldn't want to rely just on icounts and eyeballing given its > dependent on branch predictor accuracy and such. BPI is probably the > best platform for this kind of testing right now. > > I probably can't spin it this week, but probably could next week. Thanks. If you don't mind, please also collect the static code-size statistics so we can decide if we need to choose different strategies when optimizing for size vs. speed. > > jeff >
On 9/5/24 12:59 PM, Palmer Dabbelt wrote: > On Thu, 05 Sep 2024 11:52:57 PDT (-0700), Palmer Dabbelt wrote: >> We have cheap logical ops, so let's just move this back to the default >> to take advantage of the standard branch/op hueristics. >> >> gcc/ChangeLog: >> >> PR target/116615 >> * config/riscv/riscv.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Remove. >> --- >> There's a bunch more discussion in the bug, but it's starting to smell >> like this was just a holdover from MIPS (where maybe it also shouldn't >> be set). I haven't tested this, but I figured I'd send the patch to get >> a little more visibility. >> >> I guess we should also kick off something like a SPEC run to make sure >> there's no regressions? > > Sorry I missed it in the bug, but Ruoyao points to dddafe94823 > ("LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT") where short- > circuiting the FP comparisons helps on LoongArch. > > Not sure if I'm also missing something here, but it kind of feels like > that should be handled by a more generic optimization decision that just > globally "should we short circuit logical ops" -- assuming it really is > the FP comparisons that are causing the cost, as opposed to the actual > logical ops themselves. > > Probably best to actually run the benchmarks, though... THe #define essentially is overriding the generic heuristics which look at branch cost to determine how aggressively to try and combine several conditional branch conditions using logical ops so they can use a single conditional branch in the end. I don't remember all the history here, but in retrospect, the mere existence of that #define points to a failing in the costing models. FWIW, my general sense is that the gimple phases shouldn't work *too* hard to try and combine logical ops, but the if-converters in the RTL phases should be fairly aggressive. THe fact that we use BRANCH_COST to drive both is likely sub-optimal. jeff
On Thu, Sep 5, 2024 at 2:57 PM Jeff Law <jeffreyalaw@gmail.com> wrote: > > > > On 9/5/24 12:59 PM, Palmer Dabbelt wrote: > > On Thu, 05 Sep 2024 11:52:57 PDT (-0700), Palmer Dabbelt wrote: > >> We have cheap logical ops, so let's just move this back to the default > >> to take advantage of the standard branch/op hueristics. > >> > >> gcc/ChangeLog: > >> > >> PR target/116615 > >> * config/riscv/riscv.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Remove. > >> --- > >> There's a bunch more discussion in the bug, but it's starting to smell > >> like this was just a holdover from MIPS (where maybe it also shouldn't > >> be set). I haven't tested this, but I figured I'd send the patch to get > >> a little more visibility. > >> > >> I guess we should also kick off something like a SPEC run to make sure > >> there's no regressions? > > > > Sorry I missed it in the bug, but Ruoyao points to dddafe94823 > > ("LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT") where short- > > circuiting the FP comparisons helps on LoongArch. > > > > Not sure if I'm also missing something here, but it kind of feels like > > that should be handled by a more generic optimization decision that just > > globally "should we short circuit logical ops" -- assuming it really is > > the FP comparisons that are causing the cost, as opposed to the actual > > logical ops themselves. > > > > Probably best to actually run the benchmarks, though... > THe #define essentially is overriding the generic heuristics which look > at branch cost to determine how aggressively to try and combine several > conditional branch conditions using logical ops so they can use a single > conditional branch in the end. > > I don't remember all the history here, but in retrospect, the mere > existence of that #define points to a failing in the costing models. I provided the original history of LOGICAL_OP_NON_SHORT_CIRCUIT in the RISCV bug report. And yes there is a costing model fail here. LOGICAL_OP_NON_SHORT_CIRCUIT was useful if you have a decent cset (or these days have a ccmp optab). One cost model issue is LOGICAL_OP_NON_SHORT_CIRCUIT does not handle if the comparison was fp or integer (which would handle the Loonsoog and MIPS; and to less sense RISCV). PowerPC backend does not implement the ccmp optab nor does it have a decent costing cset so having it as 0 is correct; even though BRANCH cost might be low for the target (though it could implement ccmp optab now but nobody has that implemented yet). Note RISCV's cset is cheap (both size and speed) due to being close to MIPS and just having instructions which set the GPRs and then comparing against 0. I don't have time until next year to start looking at improving the situation with respect of LOGICAL_OP_NON_SHORT_CIRCUIT/BRANCH_COST; it is on my radar since I want to improve how aarch64's ccmp is done and remove the use of LOGICAL_OP_NON_SHORT_CIRCUIT from fold-cost to only being in the ifcombine (or maybe even just in isel) pass. Thanks, Andrew Pinski > > FWIW, my general sense is that the gimple phases shouldn't work *too* > hard to try and combine logical ops, but the if-converters in the RTL > phases should be fairly aggressive. THe fact that we use BRANCH_COST > to drive both is likely sub-optimal. > jeff
diff --git a/gcc/config/riscv/riscv.h b/gcc/config/riscv/riscv.h index ead97867eb8..a0ccd1fc762 100644 --- a/gcc/config/riscv/riscv.h +++ b/gcc/config/riscv/riscv.h @@ -939,8 +939,6 @@ extern enum riscv_cc get_riscv_cc (const rtx use); #define TARGET_VECTOR_MISALIGN_SUPPORTED \ riscv_vector_unaligned_access_p -#define LOGICAL_OP_NON_SHORT_CIRCUIT 0 - /* Control the assembler format that we output. */ /* Output to assembler file text saying following lines