Message ID | 20240905185257.22411-1-palmer@rivosinc.com |
---|---|
State | New |
Headers | show |
Series | RISC-V: Define LOGICAL_OP_NON_SHORT_CIRCUIT to 1 [PR116615] | expand |
On Thu, 05 Sep 2024 11:52:57 PDT (-0700), Palmer Dabbelt wrote: > We have cheap logical ops, so let's just move this back to the default > to take advantage of the standard branch/op hueristics. > > gcc/ChangeLog: > > PR target/116615 > * config/riscv/riscv.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Remove. > --- > There's a bunch more discussion in the bug, but it's starting to smell > like this was just a holdover from MIPS (where maybe it also shouldn't > be set). I haven't tested this, but I figured I'd send the patch to get > a little more visibility. > > I guess we should also kick off something like a SPEC run to make sure > there's no regressions? Sorry I missed it in the bug, but Ruoyao points to dddafe94823 ("LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT") where short-circuiting the FP comparisons helps on LoongArch. Not sure if I'm also missing something here, but it kind of feels like that should be handled by a more generic optimization decision that just globally "should we short circuit logical ops" -- assuming it really is the FP comparisons that are causing the cost, as opposed to the actual logical ops themselves. Probably best to actually run the benchmarks, though... > --- > gcc/config/riscv/riscv.h | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/gcc/config/riscv/riscv.h b/gcc/config/riscv/riscv.h > index ead97867eb8..a0ccd1fc762 100644 > --- a/gcc/config/riscv/riscv.h > +++ b/gcc/config/riscv/riscv.h > @@ -939,8 +939,6 @@ extern enum riscv_cc get_riscv_cc (const rtx use); > #define TARGET_VECTOR_MISALIGN_SUPPORTED \ > riscv_vector_unaligned_access_p > > -#define LOGICAL_OP_NON_SHORT_CIRCUIT 0 > - > /* Control the assembler format that we output. */ > > /* Output to assembler file text saying following lines
On Thu, 2024-09-05 at 11:59 -0700, Palmer Dabbelt wrote: > On Thu, 05 Sep 2024 11:52:57 PDT (-0700), Palmer Dabbelt wrote: > > We have cheap logical ops, so let's just move this back to the default > > to take advantage of the standard branch/op hueristics. > > > > gcc/ChangeLog: > > > > PR target/116615 > > * config/riscv/riscv.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Remove. > > --- > > There's a bunch more discussion in the bug, but it's starting to smell > > like this was just a holdover from MIPS (where maybe it also shouldn't > > be set). I haven't tested this, but I figured I'd send the patch to get > > a little more visibility. > > > > I guess we should also kick off something like a SPEC run to make sure > > there's no regressions? > > Sorry I missed it in the bug, but Ruoyao points to dddafe94823 > ("LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT") where > short-circuiting the FP comparisons helps on LoongArch. > > Not sure if I'm also missing something here, but it kind of feels like > that should be handled by a more generic optimization decision that just > globally "should we short circuit logical ops" -- assuming it really is > the FP comparisons that are causing the cost, as opposed to the actual > logical ops themselves. IIUC there are some contributing factors here: 1. On LoongArch FP comparison is slow (costing 5 cycles). 2. On LoongArch the FP comparison result is stored into FCC registers, and to do logical operations on two comparison results they need to be moved into GPR first. The move costs one or two cycles (depending on the uarch). and maybe 3. The FP comparison result in the SPEC tests are somewhat predictable. IIRC when I tested dddafe94823 I made a test program where the FP comparison results are "randomized" (so the branch predictor is defeated), then the branch-less code generated with -Ofast --param logical-op-non-short-circuit=1 was actually faster than the code generated with -Ofast --param logical-op-non-short-circuit=0. AFAIK 2 isn't an issue for RISC-V (where FP comparison result is just in GPR) but 1 and 3 may still need to be considered.
On 9/5/24 12:52 PM, Palmer Dabbelt wrote: > We have cheap logical ops, so let's just move this back to the default > to take advantage of the standard branch/op hueristics. > > gcc/ChangeLog: > > PR target/116615 > * config/riscv/riscv.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Remove. > --- > There's a bunch more discussion in the bug, but it's starting to smell > like this was just a holdover from MIPS (where maybe it also shouldn't > be set). I haven't tested this, but I figured I'd send the patch to get > a little more visibility. > > I guess we should also kick off something like a SPEC run to make sure > there's no regressions? Yea, I'd definitely want to see some hard data on an implementation for this. I wouldn't want to rely just on icounts and eyeballing given its dependent on branch predictor accuracy and such. BPI is probably the best platform for this kind of testing right now. I probably can't spin it this week, but probably could next week. jeff
On Thu, Sep 5, 2024 at 2:52 PM Jeff Law <jeffreyalaw@gmail.com> wrote: > > > > On 9/5/24 12:52 PM, Palmer Dabbelt wrote: > > We have cheap logical ops, so let's just move this back to the default > > to take advantage of the standard branch/op hueristics. > > > > gcc/ChangeLog: > > > > PR target/116615 > > * config/riscv/riscv.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Remove. > > --- > > There's a bunch more discussion in the bug, but it's starting to smell > > like this was just a holdover from MIPS (where maybe it also shouldn't > > be set). I haven't tested this, but I figured I'd send the patch to get > > a little more visibility. > > > > I guess we should also kick off something like a SPEC run to make sure > > there's no regressions? > Yea, I'd definitely want to see some hard data on an implementation for > this. I wouldn't want to rely just on icounts and eyeballing given its > dependent on branch predictor accuracy and such. BPI is probably the > best platform for this kind of testing right now. > > I probably can't spin it this week, but probably could next week. Thanks. If you don't mind, please also collect the static code-size statistics so we can decide if we need to choose different strategies when optimizing for size vs. speed. > > jeff >
On 9/5/24 12:59 PM, Palmer Dabbelt wrote: > On Thu, 05 Sep 2024 11:52:57 PDT (-0700), Palmer Dabbelt wrote: >> We have cheap logical ops, so let's just move this back to the default >> to take advantage of the standard branch/op hueristics. >> >> gcc/ChangeLog: >> >> PR target/116615 >> * config/riscv/riscv.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Remove. >> --- >> There's a bunch more discussion in the bug, but it's starting to smell >> like this was just a holdover from MIPS (where maybe it also shouldn't >> be set). I haven't tested this, but I figured I'd send the patch to get >> a little more visibility. >> >> I guess we should also kick off something like a SPEC run to make sure >> there's no regressions? > > Sorry I missed it in the bug, but Ruoyao points to dddafe94823 > ("LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT") where short- > circuiting the FP comparisons helps on LoongArch. > > Not sure if I'm also missing something here, but it kind of feels like > that should be handled by a more generic optimization decision that just > globally "should we short circuit logical ops" -- assuming it really is > the FP comparisons that are causing the cost, as opposed to the actual > logical ops themselves. > > Probably best to actually run the benchmarks, though... THe #define essentially is overriding the generic heuristics which look at branch cost to determine how aggressively to try and combine several conditional branch conditions using logical ops so they can use a single conditional branch in the end. I don't remember all the history here, but in retrospect, the mere existence of that #define points to a failing in the costing models. FWIW, my general sense is that the gimple phases shouldn't work *too* hard to try and combine logical ops, but the if-converters in the RTL phases should be fairly aggressive. THe fact that we use BRANCH_COST to drive both is likely sub-optimal. jeff
On Thu, Sep 5, 2024 at 2:57 PM Jeff Law <jeffreyalaw@gmail.com> wrote: > > > > On 9/5/24 12:59 PM, Palmer Dabbelt wrote: > > On Thu, 05 Sep 2024 11:52:57 PDT (-0700), Palmer Dabbelt wrote: > >> We have cheap logical ops, so let's just move this back to the default > >> to take advantage of the standard branch/op hueristics. > >> > >> gcc/ChangeLog: > >> > >> PR target/116615 > >> * config/riscv/riscv.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Remove. > >> --- > >> There's a bunch more discussion in the bug, but it's starting to smell > >> like this was just a holdover from MIPS (where maybe it also shouldn't > >> be set). I haven't tested this, but I figured I'd send the patch to get > >> a little more visibility. > >> > >> I guess we should also kick off something like a SPEC run to make sure > >> there's no regressions? > > > > Sorry I missed it in the bug, but Ruoyao points to dddafe94823 > > ("LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT") where short- > > circuiting the FP comparisons helps on LoongArch. > > > > Not sure if I'm also missing something here, but it kind of feels like > > that should be handled by a more generic optimization decision that just > > globally "should we short circuit logical ops" -- assuming it really is > > the FP comparisons that are causing the cost, as opposed to the actual > > logical ops themselves. > > > > Probably best to actually run the benchmarks, though... > THe #define essentially is overriding the generic heuristics which look > at branch cost to determine how aggressively to try and combine several > conditional branch conditions using logical ops so they can use a single > conditional branch in the end. > > I don't remember all the history here, but in retrospect, the mere > existence of that #define points to a failing in the costing models. I provided the original history of LOGICAL_OP_NON_SHORT_CIRCUIT in the RISCV bug report. And yes there is a costing model fail here. LOGICAL_OP_NON_SHORT_CIRCUIT was useful if you have a decent cset (or these days have a ccmp optab). One cost model issue is LOGICAL_OP_NON_SHORT_CIRCUIT does not handle if the comparison was fp or integer (which would handle the Loonsoog and MIPS; and to less sense RISCV). PowerPC backend does not implement the ccmp optab nor does it have a decent costing cset so having it as 0 is correct; even though BRANCH cost might be low for the target (though it could implement ccmp optab now but nobody has that implemented yet). Note RISCV's cset is cheap (both size and speed) due to being close to MIPS and just having instructions which set the GPRs and then comparing against 0. I don't have time until next year to start looking at improving the situation with respect of LOGICAL_OP_NON_SHORT_CIRCUIT/BRANCH_COST; it is on my radar since I want to improve how aarch64's ccmp is done and remove the use of LOGICAL_OP_NON_SHORT_CIRCUIT from fold-cost to only being in the ifcombine (or maybe even just in isel) pass. Thanks, Andrew Pinski > > FWIW, my general sense is that the gimple phases shouldn't work *too* > hard to try and combine logical ops, but the if-converters in the RTL > phases should be fairly aggressive. THe fact that we use BRANCH_COST > to drive both is likely sub-optimal. > jeff
On 9/5/24 12:52 PM, Palmer Dabbelt wrote: > We have cheap logical ops, so let's just move this back to the default > to take advantage of the standard branch/op hueristics. > > gcc/ChangeLog: > > PR target/116615 > * config/riscv/riscv.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Remove. So on the BPI this is a pretty clear win. Not surprisingly perlbench and gcc are the big winners. It somewhat surprisingly regresses x264, deepsjeng & leela, but the magnitudes are smaller. The net from a cycle perspective is 2.4%. Every benchmark looks better from a branch count perspective. So in my mind it's just a matter of fixing any testsuite fallout (I would expect some) and this is OK. jeff
On Wed, Oct 2, 2024 at 5:56 AM Jeff Law <jeffreyalaw@gmail.com> wrote: > > > > On 9/5/24 12:52 PM, Palmer Dabbelt wrote: > > We have cheap logical ops, so let's just move this back to the default > > to take advantage of the standard branch/op hueristics. > > > > gcc/ChangeLog: > > > > PR target/116615 > > * config/riscv/riscv.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Remove. > So on the BPI this is a pretty clear win. Not surprisingly perlbench > and gcc are the big winners. It somewhat surprisingly regresses x264, > deepsjeng & leela, but the magnitudes are smaller. The net from a cycle > perspective is 2.4%. Every benchmark looks better from a branch count > perspective. > > So in my mind it's just a matter of fixing any testsuite fallout (I > would expect some) and this is OK. Jeff, were you able to measure the change in static code size, too? These results are very encouraging, but I'd like to make sure we don't need to retain the current behavior when optimizing for size. > > > jeff >
On 10/2/24 4:39 PM, Andrew Waterman wrote: > On Wed, Oct 2, 2024 at 5:56 AM Jeff Law <jeffreyalaw@gmail.com> wrote: >> >> >> >> On 9/5/24 12:52 PM, Palmer Dabbelt wrote: >>> We have cheap logical ops, so let's just move this back to the default >>> to take advantage of the standard branch/op hueristics. >>> >>> gcc/ChangeLog: >>> >>> PR target/116615 >>> * config/riscv/riscv.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Remove. >> So on the BPI this is a pretty clear win. Not surprisingly perlbench >> and gcc are the big winners. It somewhat surprisingly regresses x264, >> deepsjeng & leela, but the magnitudes are smaller. The net from a cycle >> perspective is 2.4%. Every benchmark looks better from a branch count >> perspective. >> >> So in my mind it's just a matter of fixing any testsuite fallout (I >> would expect some) and this is OK. > > Jeff, were you able to measure the change in static code size, too? > These results are very encouraging, but I'd like to make sure we don't > need to retain the current behavior when optimizing for size. Codesize is ever so slightly worse. As in less than .1%. Not worth it in my mind to do something different in that range. Jeff
On Wed, Oct 2, 2024 at 4:41 PM Jeff Law <jeffreyalaw@gmail.com> wrote: > > > > On 10/2/24 4:39 PM, Andrew Waterman wrote: > > On Wed, Oct 2, 2024 at 5:56 AM Jeff Law <jeffreyalaw@gmail.com> wrote: > >> > >> > >> > >> On 9/5/24 12:52 PM, Palmer Dabbelt wrote: > >>> We have cheap logical ops, so let's just move this back to the default > >>> to take advantage of the standard branch/op hueristics. > >>> > >>> gcc/ChangeLog: > >>> > >>> PR target/116615 > >>> * config/riscv/riscv.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Remove. > >> So on the BPI this is a pretty clear win. Not surprisingly perlbench > >> and gcc are the big winners. It somewhat surprisingly regresses x264, > >> deepsjeng & leela, but the magnitudes are smaller. The net from a cycle > >> perspective is 2.4%. Every benchmark looks better from a branch count > >> perspective. > >> > >> So in my mind it's just a matter of fixing any testsuite fallout (I > >> would expect some) and this is OK. > > > > Jeff, were you able to measure the change in static code size, too? > > These results are very encouraging, but I'd like to make sure we don't > > need to retain the current behavior when optimizing for size. > Codesize is ever so slightly worse. As in less than .1%. Not worth it > in my mind to do something different in that range. Thanks. Agreed. > > Jeff
On Thu, Oct 3, 2024 at 3:15 AM Andrew Waterman <andrew@sifive.com> wrote: > > On Wed, Oct 2, 2024 at 4:41 PM Jeff Law <jeffreyalaw@gmail.com> wrote: > > > > > > > > On 10/2/24 4:39 PM, Andrew Waterman wrote: > > > On Wed, Oct 2, 2024 at 5:56 AM Jeff Law <jeffreyalaw@gmail.com> wrote: > > >> > > >> > > >> > > >> On 9/5/24 12:52 PM, Palmer Dabbelt wrote: > > >>> We have cheap logical ops, so let's just move this back to the default > > >>> to take advantage of the standard branch/op hueristics. > > >>> > > >>> gcc/ChangeLog: > > >>> > > >>> PR target/116615 > > >>> * config/riscv/riscv.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Remove. > > >> So on the BPI this is a pretty clear win. Not surprisingly perlbench > > >> and gcc are the big winners. It somewhat surprisingly regresses x264, > > >> deepsjeng & leela, but the magnitudes are smaller. The net from a cycle > > >> perspective is 2.4%. Every benchmark looks better from a branch count > > >> perspective. > > >> > > >> So in my mind it's just a matter of fixing any testsuite fallout (I > > >> would expect some) and this is OK. > > > > > > Jeff, were you able to measure the change in static code size, too? > > > These results are very encouraging, but I'd like to make sure we don't > > > need to retain the current behavior when optimizing for size. > > Codesize is ever so slightly worse. As in less than .1%. Not worth it > > in my mind to do something different in that range. It probably helps code-size when not optimizing for size depending on how you align jumps. Richard. > Thanks. Agreed. > > > > > Jeff
On 10/4/24 12:42 AM, Richard Biener wrote: > On Thu, Oct 3, 2024 at 3:15 AM Andrew Waterman <andrew@sifive.com> wrote: >> >> On Wed, Oct 2, 2024 at 4:41 PM Jeff Law <jeffreyalaw@gmail.com> wrote: >>> >>> >>> >>> On 10/2/24 4:39 PM, Andrew Waterman wrote: >>>> On Wed, Oct 2, 2024 at 5:56 AM Jeff Law <jeffreyalaw@gmail.com> wrote: >>>>> >>>>> >>>>> >>>>> On 9/5/24 12:52 PM, Palmer Dabbelt wrote: >>>>>> We have cheap logical ops, so let's just move this back to the default >>>>>> to take advantage of the standard branch/op hueristics. >>>>>> >>>>>> gcc/ChangeLog: >>>>>> >>>>>> PR target/116615 >>>>>> * config/riscv/riscv.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Remove. >>>>> So on the BPI this is a pretty clear win. Not surprisingly perlbench >>>>> and gcc are the big winners. It somewhat surprisingly regresses x264, >>>>> deepsjeng & leela, but the magnitudes are smaller. The net from a cycle >>>>> perspective is 2.4%. Every benchmark looks better from a branch count >>>>> perspective. >>>>> >>>>> So in my mind it's just a matter of fixing any testsuite fallout (I >>>>> would expect some) and this is OK. >>>> >>>> Jeff, were you able to measure the change in static code size, too? >>>> These results are very encouraging, but I'd like to make sure we don't >>>> need to retain the current behavior when optimizing for size. >>> Codesize is ever so slightly worse. As in less than .1%. Not worth it >>> in my mind to do something different in that range. > > It probably helps code-size when not optimizing for size depending on > how you align jumps. By default we aren't aligning jumps at all. The infrastructure is in place to allow uarchs to select their preferences though (we're using that infrastructure internally). jeff
On 9/5/24 12:52 PM, Palmer Dabbelt wrote: > We have cheap logical ops, so let's just move this back to the default > to take advantage of the standard branch/op hueristics. > > gcc/ChangeLog: > > PR target/116615 > * config/riscv/riscv.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Remove. > --- > There's a bunch more discussion in the bug, but it's starting to smell > like this was just a holdover from MIPS (where maybe it also shouldn't > be set). I haven't tested this, but I figured I'd send the patch to get > a little more visibility. > > I guess we should also kick off something like a SPEC run to make sure > there's no regressions? So as I noted earlier, this appears to be a nice win on the BPI. Testsuite fallout is minimal -- just the one SFB related test tripping at -Os that was also hit by Andrew P's work. After looking at it more closely, the SFB codegen and the codegen after Andrew's work should be equivalent assuming two independent ops can dispatch together. The test actually generates sensible code at -Os. It's the -Os in combination with the -fno-ssa-phiopt that causes problems. I think the best thing to do here is just skip at -Os. That still keeps a degree of testing the SFB path. Tested successfully in my tester. But will wait for the pre-commit tester to render a verdict before moving forward. Jeff diff --git a/gcc/config/riscv/riscv.h b/gcc/config/riscv/riscv.h index 3aecb43f831..53b7b2a40ed 100644 --- a/gcc/config/riscv/riscv.h +++ b/gcc/config/riscv/riscv.h @@ -939,8 +939,6 @@ extern enum riscv_cc get_riscv_cc (const rtx use); #define TARGET_VECTOR_MISALIGN_SUPPORTED \ riscv_vector_unaligned_access_p -#define LOGICAL_OP_NON_SHORT_CIRCUIT 0 - /* Control the assembler format that we output. */ /* Output to assembler file text saying following lines diff --git a/gcc/testsuite/gcc.target/riscv/cset-sext-sfb.c b/gcc/testsuite/gcc.target/riscv/cset-sext-sfb.c index 6e9f8cc61de..1ee45b33e15 100644 --- a/gcc/testsuite/gcc.target/riscv/cset-sext-sfb.c +++ b/gcc/testsuite/gcc.target/riscv/cset-sext-sfb.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-skip-if "" { *-*-* } { "-O0" "-Og" "-O1" } } */ +/* { dg-skip-if "" { *-*-* } { "-O0" "-Og" "-O1" "-Os" } } */ /* { dg-options "-march=rv32gc -mtune=sifive-7-series -mbranch-cost=1 -fno-ssa-phiopt -fdump-rtl-ce1" { target { rv32 } } } */ /* { dg-options "-march=rv64gc -mtune=sifive-7-series -mbranch-cost=1 -fno-ssa-phiopt -fdump-rtl-ce1" { target { rv64 } } } */
diff --git a/gcc/config/riscv/riscv.h b/gcc/config/riscv/riscv.h index ead97867eb8..a0ccd1fc762 100644 --- a/gcc/config/riscv/riscv.h +++ b/gcc/config/riscv/riscv.h @@ -939,8 +939,6 @@ extern enum riscv_cc get_riscv_cc (const rtx use); #define TARGET_VECTOR_MISALIGN_SUPPORTED \ riscv_vector_unaligned_access_p -#define LOGICAL_OP_NON_SHORT_CIRCUIT 0 - /* Control the assembler format that we output. */ /* Output to assembler file text saying following lines