Message ID | 202007060111.0661BHHM032190@ignucius.se.axis.com |
---|---|
State | New |
Headers | show |
Series | RFA: Fix combine.c combining a move and a non-move into two non-moves, PR93372 | expand |
Hi! On Mon, Jul 06, 2020 at 03:11:17AM +0200, Hans-Peter Nilsson wrote: > TL;DR: fixing a misdetection of what is a "simple move". That is not a very correct characterisation of what this does :-) > Looking into performace degradation after de-cc0 for CRIS, I > noticed combine behaving badly; it changed a move and a > right-shift into two right-shifts, where the "combined" move was > not eliminated in later passes, and where the deficiency caused > an extra insn in a hot loop: crcu16 (and crcu32) in coremark. > > Before de-cc0, the insns input to combine looked like: > 33: r58:SI=r56:SI 0>>r48:SI > REG_DEAD r56:SI > 35: r37:HI=r58:SI#0 > and after: > 33: {r58:SI=r56:SI 0>>r48:SI;clobber dccr:CC;} > REG_DEAD r56:SI > REG_UNUSED dccr:CC > 35: {r37:HI=r58:SI#0;clobber dccr:CC;} > REG_UNUSED dccr:CC So a shift like this is at most as expensive as a move, on your target (or, in your backend, anyway ;-) ) > That is, there's always a parallel with a clobber of the > condition-codes register. Being a parallel, it's not an > is_just_move, but e.g. a single_set. > > For the de-cc0:ed "combination", it ended up as > 33: {r58:SI=r56:SI 0>>r48:SI;clobber dccr:CC;} > REG_UNUSED dccr:CC > 35: {r37:HI#0=r56:SI 0>>r48:SI;clobber dccr:CC;} > REG_DEAD r56:SI > REG_UNUSED dccr:CC > That is, a move and a shift turned into two shifts; the extra > shift is not eliminated by later passes, while the move was > (with cc0, and "will be again") leading to redundant > instructions. Which was the whole point of the is_just_move() thing, yes. Combine doesn't know most moves will be eliminated by RA (but many are useful to do have before RA, because it gives RA much more freedom). If a move is the same cost as a simple insn, doing two (say shift, like here) insns in parallel is cheaper on most machines than having a shift and a move sequentially. (2-2 combinations are helpful on single-scalar and even in-order machines as well, btw). > At first I thought this was due to parallel-ignorant old code > but the "guilty" change is actually pretty recent. Regarding a > parallel with a clobber not being "just" a move, there's only > the two adjacent callers seen in the patch (obviously with the > rename), and their use exactly matches to check that the > argument is a single_set which is a move. It's always applied > to an rtx_insn, so I changed the type and name to avoid the > "just" issue. I had to adjust the type when calling single_set. But it is *not* supposed to be the same as single_set. > I checked the original commit, c4c5ad1d6d1e1e a.k.a r263067 and > it seems parallels-as-sets were just overlooked and that this They were not. It causes regressions. That is why it has a different name, not something with "single set". "just_move" isn't a very good name, I couldn't come up with something better, it is a pretty complicated concept :-/ "i2_should_not_be_2_2_combined"? I'll rerun some testing to show this. It'll take a while. > patch appears to agree with the intent and the comment at the > use of i2_was_move and i3_was_move, which has a clause saying > "Also do this if we started with two insns neither of which was > a simple move". That says that we combine to 2 insns also if we started with only 2, but neither of those was just a move. > With this correction in place, the performance degradation > related to de-cc0 of the CRIS port as measured by coremark is > gone and turned into a small win. N.B.: there certainly is a > code difference in other hot functions, and the swing between > different functions is larger than this difference; to be dealt > with separately. > > Tested cris-elf, x86_64-linux, powerpc64le-linux, 2/3 through > aarch64-linux (unexpectedly slow). > > Ok to commit? No, sorry. One thing that could work is allowing (unused) clobbers of fixed registers (like you have here), or maybe some hook is needed to say this register is like a flags register, or similar. That should work for you, and not regress other targets, maybe even help a little? We'll see. Thanks, Segher
Hi! On Tue, Jul 07, 2020 at 02:50:09AM +0200, Hans-Peter Nilsson wrote: > > On Mon, Jul 06, 2020 at 03:11:17AM +0200, Hans-Peter Nilsson wrote: > > > TL;DR: fixing a misdetection of what is a "simple move". > > > > That is not a very correct characterisation of what this does :-) > > That's apparently where we completely disagree. :-) Well, I wrote that code, I know what is considered "just a move" there. You want to extend that, and that is fine, but this is not a bug. Taken to the extreme, anything in GCC is completely buggy, because it doesn't solve world hunger (yet!), following that line of thought. > > > Looking into performace degradation after de-cc0 for CRIS, I > > > noticed combine behaving badly; it changed a move and a > > > right-shift into two right-shifts, where the "combined" move was > > > not eliminated in later passes, and where the deficiency caused > > > an extra insn in a hot loop: crcu16 (and crcu32) in coremark. > > > > > > Before de-cc0, the insns input to combine looked like: > > > 33: r58:SI=r56:SI 0>>r48:SI > > > REG_DEAD r56:SI > > > 35: r37:HI=r58:SI#0 > > > and after: > > > 33: {r58:SI=r56:SI 0>>r48:SI;clobber dccr:CC;} > > > REG_DEAD r56:SI > > > REG_UNUSED dccr:CC > > > 35: {r37:HI=r58:SI#0;clobber dccr:CC;} > > > REG_UNUSED dccr:CC > > > > So a shift like this is at most as expensive as a move, on your target > > (or, in your backend, anyway ;-) ) > > On CRIS, the backend *and* the target, yes; one cycle, one short > instruction. So combine did what it is supposed to do. > > > That is, there's always a parallel with a clobber of the > > > condition-codes register. Being a parallel, it's not an > > > is_just_move, but e.g. a single_set. This is something that happens on many targets. For some, only for some instructions (and flag registers). For some, for many instructions; and for really unhappy targets, for almost all instructions, even for register moves and/or loads and/or stores. > > > For the de-cc0:ed "combination", it ended up as > > > 33: {r58:SI=r56:SI 0>>r48:SI;clobber dccr:CC;} > > > REG_UNUSED dccr:CC > > > 35: {r37:HI#0=r56:SI 0>>r48:SI;clobber dccr:CC;} > > > REG_DEAD r56:SI > > > REG_UNUSED dccr:CC > > > That is, a move and a shift turned into two shifts; the extra > > > shift is not eliminated by later passes, while the move was > > > (with cc0, and "will be again") leading to redundant > > > instructions. > > > > Which was the whole point of the is_just_move() thing, yes. Combine > > doesn't know most moves will be eliminated by RA (but many are useful to > > do have before RA, because it gives RA much more freedom). If a move is > > the same cost as a simple insn, doing two (say shift, like here) insns > > in parallel is cheaper on most machines than having a shift and a move > > sequentially. > > Most parallel machines you mean, No, I mean most machines, not just super-scalar machines. > but why bring up them when > there's no means for combine to tell the difference? I did not bring them up, and this optimisation is important not only for super-scalar targets. But, the most important targets for GCC (in terms of how many people use it, all targets are important for many other reasons) are all supper-scalar. > Here, the > end result is that it *added* an instruction to the hot loop. It did not. Combine *never* does that. > It's a deficiency and it caused a performance regression, can't > argue with that. It did not cause any regression. It can be improved, sure, and thank you for pointing that out! > > (2-2 combinations are helpful on single-scalar and even > > in-order machines as well, btw). > > I certainly don't contest that the move can be eliminated, and > that most cost-effective 2-2 eliminations are helpful. (See my > other post about combine being eager with allowing same-cost > combinations.) I did not see that post, do you have a pointer? > > > At first I thought this was due to parallel-ignorant old code > > > but the "guilty" change is actually pretty recent. Regarding a > > > parallel with a clobber not being "just" a move, there's only > > > the two adjacent callers seen in the patch (obviously with the > > > rename), and their use exactly matches to check that the > > > argument is a single_set which is a move. It's always applied > > > to an rtx_insn, so I changed the type and name to avoid the > > > "just" issue. I had to adjust the type when calling single_set. > > > > But it is *not* supposed to be the same as single_set. > > (I'm not saying that a single_set is a move. But that's > obvious.) I guess you meant to say that a single_set with a > general_operand as a source (as in the patch) is not supposed to > be the same as a move. This is the only place in combine where > that distinction would be important. Why? It used to be that is_just_move was called on new*pat as well, so you simply *could not* use single_set there (you have no rtx_insn*). single_set also allows other insns, for example, multiple sets! But testing shows that this does not actually happen often at all, so we could use single_set here and not change much at all for most other targets. > I think you're just overconcious about clobbers here. It has nothing to do with that. > > > I checked the original commit, c4c5ad1d6d1e1e a.k.a r263067 and > > > it seems parallels-as-sets were just overlooked and that this > > > > They were not. It causes regressions. > > Do you have some pointers to PR:s or something else backing that > statement, or is it your work-in-progress hinted below? I do not know what "work in progress" you mean? > > That is why it has a different > > name, not something with "single set". "just_move" isn't a very good > > name, I couldn't come up with something better, it is a pretty > > complicated concept :-/ > > > > "i2_should_not_be_2_2_combined"? > > > > I'll rerun some testing to show this. It'll take a while. > > Care to fill in some details on what kind of testing? Same as always: I build the toolchain and Linux kernel for 30 different targets, and compare the output. For most things, just looking at the size of the generated binary is telling (it shows some combinations did or did not happen). For a more detailed view, you need to use at the actual machine code diffs directly, which is much more work (sometimes diff of size per function is useful to see first as well). For 2-2, size does *not* usually change, which brings us immediately into "a lot more work" territory. Oh, and all x86 compilers ICE. > > > With this correction in place, the performance degradation > > > related to de-cc0 of the CRIS port as measured by coremark is > > > gone and turned into a small win. N.B.: there certainly is a > > > code difference in other hot functions, and the swing between > > > different functions is larger than this difference; to be dealt > > > with separately. > > > > > > Tested cris-elf, x86_64-linux, powerpc64le-linux, 2/3 through > > > aarch64-linux (unexpectedly slow). > > > > > > Ok to commit? > > > > No, sorry. > > Sigh. I'm very interested in what your investigation will show. It shows we can change to use single_set here. > > One thing that could work is allowing (unused) clobbers of fixed > > registers (like you have here), or maybe some hook is needed to say this > > register is like a flags register, or similar. That should work for you, > > and not regress other targets, maybe even help a little? We'll see. > > Still, there is already TARGET_FLAGS_REGNUM (a "hooked" > constant), so I take it you would be happy if we recognize a > clobber of *just that*, in a parallel? (I'll take care of > updating tm.texi of course.) Most targets do not use cmpelim, and many *can not* define TARGET_FLAGS_REGNUM, unfortunately. I'll review the original patch again, to point out where it still needs changing. Thanks, Segher
Hi! On Mon, Jul 06, 2020 at 03:11:17AM +0200, Hans-Peter Nilsson wrote: > TL;DR: fixing a misdetection of what is a "simple move". As set before, this is not a fix, not a "misdetection", it is plain and simple a behaviour change. "Use single_set for is_just_move" would be a fine subject. > Regarding a > parallel with a clobber not being "just" a move, there's only > the two adjacent callers seen in the patch (obviously with the > rename), and their use exactly matches to check that the > argument is a single_set which is a move. This isn't true, single_set has somewhat different semantics. Some of which is exactly the change you are looking for, but there are more differences. > It's always applied > to an rtx_insn, so I changed the type and name to avoid the > "just" issue. I had to adjust the type when calling single_set. Don't change the name please. Changing it to take an rtx_insn* is fine, we aren't likely to change back to testing the resulting patterns. > I checked the original commit, c4c5ad1d6d1e1e a.k.a r263067 and The history is years older (some of which is on gcc-patches@). I'll make a simpler patch. Thanks! Segher
On Mon, Jul 13, 2020 at 07:29:00AM +0200, Hans-Peter Nilsson wrote: > > From: Segher Boessenkool <segher@kernel.crashing.org> > > Date: Tue, 7 Jul 2020 22:50:43 +0200 > > > I'll make a simpler patch. Thanks! > > You're welcome. So, you'll take care of the updated patch > yourself? Yes. Delayed a bit, I had a lot to do for 10.2... But Soon(tm) :-) Segher
Hi! On Mon, Jul 13, 2020 at 07:25:37AM +0200, Hans-Peter Nilsson wrote: > > > > > TL;DR: fixing a misdetection of what is a "simple move". > > > > > > > > That is not a very correct characterisation of what this does :-) > > > > > > That's apparently where we completely disagree. :-) > > > > Well, I wrote that code, I know what is considered "just a move" there. > > You lost some context: I'm comparing before/after the > cc0-conversion for CRIS, where this is a misdetection (a false > negative) of a move and causes a performance-regression. The cc0 conversion caused a performance regression. You can improve some code in combine to make that not happen. > > > I certainly don't contest that the move can be eliminated, and > > > that most cost-effective 2-2 eliminations are helpful. (See my > > > other post about combine being eager with allowing same-cost > > > combinations.) > > > > I did not see that post, do you have a pointer? > > https://gcc.gnu.org/pipermail/gcc-patches/2020-July/549416.html I'll reply to that separately. > > single_set also allows other insns, for example, multiple sets! > > ...where the other sets are unused. Not sure how that kind of > thing would get combined here, but if its combined cost is > beneficial, it would be a win, I guess. The unused outputs are thrown away by combine, except when that doesn't match, then it is tried again but with the original clobbers. > > > Do you have some pointers to PR:s or something else backing that > > > statement, or is it your work-in-progress hinted below? > > > > I do not know what "work in progress" you mean? > > I'm referring to your "I'll rerun some testing to show this. > It'll take a while." Are the regressions you refer to above > tracked in bugzilla or on some mailing list? The whole 2-2 combine thing took half a year at least to develop. I have posted to gcc-patches@ about it a few times. Not having the is_just_move stuff causes highly visible regressions on some targets, I do not remember which, but it hurts all targets. 2-2 combine with the is_just_move (and make_extra_copies) stuff was a win everywhere. The tests just take long to run (it used to be 2h per run, but it is closer to 3h per run with the current compiler: building cross-compilers used to be less than 4m per target, this is much worse since a few years). Analysing the results is easy for most kinds of instruction combination: just looking at the binary size of the testcase (I usually build Linux) gives a good idea how effective combine was. But for 2-2 combinations that doesn't show much at all, so I dig through the actual resulting code (for many targets, not all 30, just those I think are interesting). > > For 2-2, size does *not* usually change, which brings us immediately > > into "a lot more work" territory. Oh, and all x86 compilers ICE. The x86-64 kernel *does* build, just some boot wrapper code fails, but the kernel itself does build. i386 does ICE however, something with memcpy or some such. > If combine only did lower-cost combinations (perhaps with > Richard Sandifords lower-size-when-tied suggestion), I guess > this wouldn't happen. 0:-) And we would regress (a LOT). > > It shows we can change to use single_set here. > > Did you mean "will show whether" or is it already complete? It did complete, yes (and didn't change a single resulting intruction). So that was easy :-) > > I'll review the original patch again, to point out where it still needs > > changing. > > ...but if you're in progress with a single_set variant, I'm all > for it. Yup, it's pretty simple actually :-) Thanks, Segher
On Tue, Jul 14, 2020 at 04:33:42PM -0500, Segher Boessenkool wrote: > > If combine only did lower-cost combinations (perhaps with > > Richard Sandifords lower-size-when-tied suggestion), I guess > > this wouldn't happen. 0:-) > > And we would regress (a LOT). Like this. C0 is an unmodified compiler. C1 is with the single_set modification to is_just_move I committed a minute ago (84c5396d4bdb). C2 is with this patch: -- 8< -- diff --git a/gcc/combine.c b/gcc/combine.c index 3a81bb6..619ba77 100644 --- a/gcc/combine.c +++ b/gcc/combine.c @@ -939,7 +939,7 @@ combine_validate_cost (rtx_insn *i0, rtx_insn *i1, rtx_insn /* Disallow this combination if both new_cost and old_cost are greater than zero, and new_cost is greater than old cost. */ - int reject = old_cost > 0 && new_cost > old_cost; + int reject = old_cost > 0 && new_cost >= old_cost; if (dump_file) { -- 8< -- [segher@gcc135 buildall]$ perl sizes.pl --percent C[012] C0 C1 C2 alpha 6045560 100.000% 100.518% arc 3529016 100.000% 99.933% arm 14173370 100.000% 101.607% arm64 12958466 100.000% 100.477% c6x 2341205 100.000% 100.154% csky 3320386 100.000% 100.838% h8300 1163584 100.000% 100.331% i386 0 0 0 ia64 18079744 100.000% 100.857% m68k 3711195 100.000% 100.327% microblaze 4930937 100.000% 100.054% mips 8403293 100.000% 100.049% mips64 6975860 100.000% 99.986% nds32 4450951 100.000% 99.992% nios2 3641733 100.000% 100.206% openrisc 4182260 100.000% 100.025% parisc 7706299 100.000% 101.500% parisc64 8677365 100.000% 101.491% powerpc 10016575 100.000% 100.001% powerpc64 17331518 100.000% 99.974% powerpc64le 17331518 100.000% 99.974% riscv32 0 0 0 riscv64 0 0 0 s390 13091897 100.000% 100.396% sh 3213207 100.000% 100.031% shnommu 1610444 100.000% 100.031% sparc 4356641 100.000% 101.521% sparc64 6745123 100.000% 101.450% x86_64 19663507 100.000% 100.007% xtensa 2105610 100.000% 100.425% Segher
diff --git a/gcc/combine.c b/gcc/combine.c index 7da144e..ed90b16 100644 --- a/gcc/combine.c +++ b/gcc/combine.c @@ -2624,13 +2624,15 @@ can_split_parallel_of_n_reg_sets (rtx_insn *insn, int n) return true; } -/* Return whether X is just a single set, with the source - a general_operand. */ +/* Return whether INSN is just a single set, with the source + a general_operand. INSN must be an insn, not stripped to its PATTERN. */ static bool -is_just_move (rtx x) +is_move (const rtx_insn *insn) { - if (INSN_P (x)) - x = PATTERN (x); + rtx x = single_set (insn); + + if (x == NULL_RTX) + return false; return (GET_CODE (x) == SET && general_operand (SET_SRC (x), VOIDmode)); } @@ -3103,8 +3105,8 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0, } /* Record whether i2 and i3 are trivial moves. */ - i2_was_move = is_just_move (i2); - i3_was_move = is_just_move (i3); + i2_was_move = is_move (i2); + i3_was_move = is_move (i3); /* Record whether I2DEST is used in I2SRC and similarly for the other cases. Knowing this will help in register status updating below. */