Message ID | HE1PR0801MB205812F95820712B305D51D7831C0@HE1PR0801MB2058.eurprd08.prod.outlook.com |
---|---|
State | New |
Headers | show |
Series | [AArch64] Use LDP/STP in shrinkwrapping | expand |
Hi Wilco, On Fri, Jan 05, 2018 at 12:22:44PM +0000, Wilco Dijkstra wrote: > An example epilog in a shrinkwrapped function before: > > ldp x21, x22, [sp,#16] > ldr x23, [sp,#32] > ldr x24, [sp,#40] > ldp x25, x26, [sp,#48] > ldr x27, [sp,#64] > ldr x28, [sp,#72] > ldr x30, [sp,#80] > ldr d8, [sp,#88] > ldp x19, x20, [sp],#96 > ret In this example, the compiler already can make a ldp for both x23/x24 and x27/x28 just fine (if not in emit_epilogue_components, then simply in a peephole); why did that not work? Or is this not the actual generated machine code (and there are labels between the insns, for example)? Segher
Segher Boessenkool wrote: > On Fri, Jan 05, 2018 at 12:22:44PM +0000, Wilco Dijkstra wrote: >> An example epilog in a shrinkwrapped function before: >> >> ldp x21, x22, [sp,#16] >> ldr x23, [sp,#32] >> ldr x24, [sp,#40] >> ldp x25, x26, [sp,#48] >> ldr x27, [sp,#64] >> ldr x28, [sp,#72] >> ldr x30, [sp,#80] >> ldr d8, [sp,#88] >> ldp x19, x20, [sp],#96 >> ret > > In this example, the compiler already can make a ldp for both x23/x24 and > x27/x28 just fine (if not in emit_epilogue_components, then simply in a > peephole); why did that not work? Or is this not the actual generated > machine code (and there are labels between the insns, for example)? This block originally had a label in it, 2 blocks emitted identical restores and then branched to the final epilog. The final epilogue was then duplicated so we end up with 2 almost identical epilogs of 10 instructions (almost since there were 1-2 unrelated instructions in both blocks). Peepholing is very conservative about instructions using SP and won't touch anything frame related. If this was working better then the backend could just emit single loads/stores and let peepholing generate LDP/STP. However this is not the real issue. In the worst case the current code may only emit LDR and STR. If there are multiple callee-saves in a block, we want to use LDP/STP, and if there is an odd number of registers, we want to add a callee-save from an inner block. Another issue is that after pro_and_epilogue pass I see multiple restores of the same registers and then a branch to the same block. We should try to avoid the unnecessary duplication. Wilco
On Mon, Jan 08, 2018 at 01:27:24PM +0000, Wilco Dijkstra wrote: > Segher Boessenkool wrote: > > On Fri, Jan 05, 2018 at 12:22:44PM +0000, Wilco Dijkstra wrote: > >> An example epilog in a shrinkwrapped function before: > >> > >> ldp x21, x22, [sp,#16] > >> ldr x23, [sp,#32] > >> ldr x24, [sp,#40] > >> ldp x25, x26, [sp,#48] > >> ldr x27, [sp,#64] > >> ldr x28, [sp,#72] > >> ldr x30, [sp,#80] > >> ldr d8, [sp,#88] > >> ldp x19, x20, [sp],#96 > >> ret > > > > In this example, the compiler already can make a ldp for both x23/x24 and > > x27/x28 just fine (if not in emit_epilogue_components, then simply in a > > peephole); why did that not work? Or is this not the actual generated > > machine code (and there are labels between the insns, for example)? > > This block originally had a label in it, 2 blocks emitted identical restores and > then branched to the final epilog. The final epilogue was then duplicated so > we end up with 2 almost identical epilogs of 10 instructions (almost since > there were 1-2 unrelated instructions in both blocks). > > Peepholing is very conservative about instructions using SP and won't touch > anything frame related. If this was working better then the backend could just > emit single loads/stores and let peepholing generate LDP/STP. How unfortunate; that should definitely be improved then. Always pairing two registers together *also* degrades code quality. > Another issue is that after pro_and_epilogue pass I see multiple restores > of the same registers and then a branch to the same block. We should try > to avoid the unnecessary duplication. It already does that if *all* predecessors of that block do that. If you want to do it in other cases, you end up with more jumps. That may be beneficial in some cases, of course, but it is not an obvious win (and in the general case it is, hrm let's use nice words, "terrible"). Segher
Segher Boessenkool wrote: > On Mon, Jan 08, 2018 at 01:27:24PM +0000, Wilco Dijkstra wrote: > >> Peepholing is very conservative about instructions using SP and won't touch >> anything frame related. If this was working better then the backend could just >> emit single loads/stores and let peepholing generate LDP/STP. > > How unfortunate; that should definitely be improved then. Improving that helps indeed but won't fix the issue. The epilog may not always be duplicated and merged like in my example. Any subset of saves and restores may not be pairable, so the worst case still has twice as many memory accesses. > Always pairing two registers together *also* degrades code quality. No, while it's not optimal, it means smaller code and fewer memory accesses. >> Another issue is that after pro_and_epilogue pass I see multiple restores >> of the same registers and then a branch to the same block. We should try >> to avoid the unnecessary duplication. > > It already does that if *all* predecessors of that block do that. If you > want to do it in other cases, you end up with more jumps. That may be > beneficial in some cases, of course, but it is not an obvious win (and in > the general case it is, hrm let's use nice words, "terrible"). That may well be the problem. So if there are N predecessors, of which N-1 need to restore the same set of callee saves, but one was shrinkwrapped, N-1 copies of the same restores might be emitted. N could be the number of blocks in a function - I really hope it doesn't work out like that... Wilco
On Mon, Jan 08, 2018 at 08:25:47PM +0000, Wilco Dijkstra wrote: > > Always pairing two registers together *also* degrades code quality. > > No, while it's not optimal, it means smaller code and fewer memory accesses. It means you execute *more* memory accesses. Always. This may be sometimes hidden, sure. I'm not saying you do not want more ldp's; I'm saying this particular strategy is very far from ideal. > >> Another issue is that after pro_and_epilogue pass I see multiple restores > >> of the same registers and then a branch to the same block. We should try > >> to avoid the unnecessary duplication. > > > > It already does that if *all* predecessors of that block do that. If you > > want to do it in other cases, you end up with more jumps. That may be > > beneficial in some cases, of course, but it is not an obvious win (and in > > the general case it is, hrm let's use nice words, "terrible"). > > That may well be the problem. So if there are N predecessors, of which N-1 > need to restore the same set of callee saves, but one was shrinkwrapped, > N-1 copies of the same restores might be emitted. N could be the number > of blocks in a function - I really hope it doesn't work out like that... In the worst case it would. OTOH, joining every combo into blocks costs O(2**C) (where C is the # components) bb's worst case. It isn't a simple problem. The current tuning works pretty well for us, but no doubt it can be improved! Segher
Segher Boessenkool wrote: > On Mon, Jan 08, 2018 at 0:25:47PM +0000, Wilco Dijkstra wrote: >> > Always pairing two registers together *also* degrades code quality. >> >> No, while it's not optimal, it means smaller code and fewer memory accesses. > > It means you execute *more* memory accesses. Always. This may be > sometimes hidden, sure. I'm not saying you do not want more ldp's; > I'm saying this particular strategy is very far from ideal. No it means less since the number of memory accesses reduces (memory bandwidth may increase but that's not an issue). You can argue that different pairings might work better as you may be able to push some LDP/STPs into less frequently executed blocks, but finding optimal pairings is a hard problem since the offsets need to be consecutive. >> That may well be the problem. So if there are N predecessors, of which N-1 >> need to restore the same set of callee saves, but one was shrinkwrapped, >> N-1 copies of the same restores might be emitted. N could be the number >> of blocks in a function - I really hope it doesn't work out like that... > > In the worst case it would. OTOH, joining every combo into blocks costs > O(2**C) (where C is the # components) bb's worst case. > > It isn't a simple problem. The current tuning works pretty well for us, > but no doubt it can be improved! Well if there are C components, we could limit the total number of saves/restores inserted to say 4C. Similarly common cases could easily share the restores without increasing the number of branches. Wilco
On Tue, Jan 09, 2018 at 12:23:42PM +0000, Wilco Dijkstra wrote: > Segher Boessenkool wrote: > > On Mon, Jan 08, 2018 at 0:25:47PM +0000, Wilco Dijkstra wrote: > >> > Always pairing two registers together *also* degrades code quality. > >> > >> No, while it's not optimal, it means smaller code and fewer memory accesses. > > > > It means you execute *more* memory accesses. Always. This may be > > sometimes hidden, sure. I'm not saying you do not want more ldp's; > > I'm saying this particular strategy is very far from ideal. > > No it means less since the number of memory accesses reduces (memory > bandwidth may increase but that's not an issue). The problem is *more* memory accesses are executed at runtime. Which is why separate shrink-wrapping does what it does: to have *fewer* executed. (It's not just the direct execution cost why that helps: more important are latencies to dependent ops, microarchitectural traps, etc.). If you make A always stored whenever B is, and the other way around, the optimal place to do it will always store at least as often as either A or B, _but can also store more often than either_. > >> That may well be the problem. So if there are N predecessors, of which N-1 > >> need to restore the same set of callee saves, but one was shrinkwrapped, > >> N-1 copies of the same restores might be emitted. N could be the number > >> of blocks in a function - I really hope it doesn't work out like that... > > > > In the worst case it would. OTOH, joining every combo into blocks costs > > O(2**C) (where C is the # components) bb's worst case. > > > > It isn't a simple problem. The current tuning works pretty well for us, > > but no doubt it can be improved! > > Well if there are C components, we could limit the total number of saves/restores > inserted to say 4C. Similarly common cases could easily share the restores > without increasing the number of branches. It is common to see many saves/restores generated for the exceptional cases. Segher
On Tue, Jan 9, 2018 at 6:54 AM, Segher Boessenkool <segher@kernel.crashing.org> wrote: > On Tue, Jan 09, 2018 at 12:23:42PM +0000, Wilco Dijkstra wrote: >> Segher Boessenkool wrote: >> > On Mon, Jan 08, 2018 at 0:25:47PM +0000, Wilco Dijkstra wrote: >> >> > Always pairing two registers together *also* degrades code quality. >> >> >> >> No, while it's not optimal, it means smaller code and fewer memory accesses. >> > >> > It means you execute *more* memory accesses. Always. This may be >> > sometimes hidden, sure. I'm not saying you do not want more ldp's; >> > I'm saying this particular strategy is very far from ideal. >> >> No it means less since the number of memory accesses reduces (memory >> bandwidth may increase but that's not an issue). > > The problem is *more* memory accesses are executed at runtime. Which is > why separate shrink-wrapping does what it does: to have *fewer* executed. > (It's not just the direct execution cost why that helps: more important > are latencies to dependent ops, microarchitectural traps, etc.). On most micro-arch of AARCH64, having one LDP/STP will take just as long as one LDR/STR as long as it is on the same cache line. So having one LDP/STP compared to two LDR?STR is much better. LDP/STP is considered one memory access really and that is where the confusion is coming from. We are reducing the overall number of memory accesses or keeping it the same on that path. Hope this explanation allows you to understand why pairing does not degrade the code quality but improves it overall. Thanks, Andrew > > If you make A always stored whenever B is, and the other way around, the > optimal place to do it will always store at least as often as either A > or B, _but can also store more often than either_. > >> >> That may well be the problem. So if there are N predecessors, of which N-1 >> >> need to restore the same set of callee saves, but one was shrinkwrapped, >> >> N-1 copies of the same restores might be emitted. N could be the number >> >> of blocks in a function - I really hope it doesn't work out like that... >> > >> > In the worst case it would. OTOH, joining every combo into blocks costs >> > O(2**C) (where C is the # components) bb's worst case. >> > >> > It isn't a simple problem. The current tuning works pretty well for us, >> > but no doubt it can be improved! >> >> Well if there are C components, we could limit the total number of saves/restores >> inserted to say 4C. Similarly common cases could easily share the restores >> without increasing the number of branches. > > It is common to see many saves/restores generated for the exceptional cases. > > > Segher
On Tue, Jan 09, 2018 at 09:13:23PM -0800, Andrew Pinski wrote: > On Tue, Jan 9, 2018 at 6:54 AM, Segher Boessenkool > <segher@kernel.crashing.org> wrote: > > On Tue, Jan 09, 2018 at 12:23:42PM +0000, Wilco Dijkstra wrote: > >> Segher Boessenkool wrote: > >> > On Mon, Jan 08, 2018 at 0:25:47PM +0000, Wilco Dijkstra wrote: > >> >> > Always pairing two registers together *also* degrades code quality. > >> >> > >> >> No, while it's not optimal, it means smaller code and fewer memory accesses. > >> > > >> > It means you execute *more* memory accesses. Always. This may be > >> > sometimes hidden, sure. I'm not saying you do not want more ldp's; > >> > I'm saying this particular strategy is very far from ideal. > >> > >> No it means less since the number of memory accesses reduces (memory > >> bandwidth may increase but that's not an issue). > > > > The problem is *more* memory accesses are executed at runtime. Which is > > why separate shrink-wrapping does what it does: to have *fewer* executed. > > (It's not just the direct execution cost why that helps: more important > > are latencies to dependent ops, microarchitectural traps, etc.). > > On most micro-arch of AARCH64, having one LDP/STP will take just as > long as one LDR/STR as long as it is on the same cache line. > So having one LDP/STP compared to two LDR?STR is much better. LDP/STP > is considered one memory access really and that is where the confusion > is coming from. We are reducing the overall number of memory accesses > or keeping it the same on that path. > Hope this explanation allows you to understand why pairing does not > degrade the code quality but improves it overall. Of course I see that ldp is useful. I don't think that this particular way of forcing more pairs is a good idea. Needs testing / benchmarking / instrumentation, and we haven't seen any of that. Forcing pairs before separate shrink-wrapping reduces the effectiveness of the latter by a lot. Segher
Segher Boessenkool wrote: > Of course I see that ldp is useful. I don't think that this particular > way of forcing more pairs is a good idea. Needs testing / benchmarking / > instrumentation, and we haven't seen any of that. I wouldn't propose a patch if it caused slowdowns. In fact I am seeing speedups, particularly benchmarks which benefit from shrinkwrapping (eg. povray). Int is flat, and there is an overall gain on fp. > Forcing pairs before separate shrink-wrapping reduces the effectiveness > of the latter by a lot. That doesn't appear to be the case - or at least any reduction in effectiveness is more than mitigated by the lower number of memory accesses and I-cache misses. To do better still you'd need to compute an optimal set of pairs, and that is quite difficult in the current infrastructure. I could dynamically create pairs just in the backend but that won't be optimal either. Wilco
On Thu, Jan 11, 2018 at 03:35:37PM +0000, Wilco Dijkstra wrote: > Segher Boessenkool wrote: > > > Of course I see that ldp is useful. I don't think that this particular > > way of forcing more pairs is a good idea. Needs testing / benchmarking / > > instrumentation, and we haven't seen any of that. > > I wouldn't propose a patch if it caused slowdowns. In fact I am seeing speedups, > particularly benchmarks which benefit from shrinkwrapping (eg. povray). Int is > flat, and there is an overall gain on fp. > > > Forcing pairs before separate shrink-wrapping reduces the effectiveness > > of the latter by a lot. > > That doesn't appear to be the case - or at least any reduction in effectiveness is > more than mitigated by the lower number of memory accesses and I-cache misses. > To do better still you'd need to compute an optimal set of pairs, and that is quite > difficult in the current infrastructure. I could dynamically create pairs just in the backend > but that won't be optimal either. Right, I certainly believe forming more pairs before sws (as you do) improves the code -- but I think forming the pairs only after sws will work even better. But yes that is more work to implement, and the benefit (if any) is unknown. I hoped I could convince you to try ;-) Segher
Segher Boessenkool <segher@kernel.crashing.org> writes: > On Thu, Jan 11, 2018 at 03:35:37PM +0000, Wilco Dijkstra wrote: >> Segher Boessenkool wrote: >> >> > Of course I see that ldp is useful. I don't think that this particular >> > way of forcing more pairs is a good idea. Needs testing / benchmarking / >> > instrumentation, and we haven't seen any of that. >> >> I wouldn't propose a patch if it caused slowdowns. In fact I am seeing >> speedups, >> particularly benchmarks which benefit from shrinkwrapping (eg. povray). Int is >> flat, and there is an overall gain on fp. >> >> > Forcing pairs before separate shrink-wrapping reduces the effectiveness >> > of the latter by a lot. >> >> That doesn't appear to be the case - or at least any reduction in >> effectiveness is >> more than mitigated by the lower number of memory accesses and I-cache misses. >> To do better still you'd need to compute an optimal set of pairs, and >> that is quite >> difficult in the current infrastructure. I could dynamically create >> pairs just in the backend >> but that won't be optimal either. > > Right, I certainly believe forming more pairs before sws (as you do) > improves the code -- but I think forming the pairs only after sws will > work even better. > > But yes that is more work to implement, and the benefit (if any) is > unknown. I hoped I could convince you to try ;-) Where we do stand with this patch? I think it's legitimately a regression fix; like Wilco, I'm seeing benchmarks in which epilogue restores are no longer paired for important functions. Even if things could be improved elsewhere, the patch seems to make sense on its own and is also nicely localised. Thanks, Richard
On Fri, Jan 05, 2018 at 12:22:44PM +0000, Wilco Dijkstra wrote: > The shrinkwrap optimization added late in GCC 7 allows each callee-save to > be delayed and done only across blocks which need a particular callee-save. > Although this reduces unnecessary memory traffic on code paths that need > few callee-saves, it typically uses LDR/STR rather than LDP/STP. The > number of LDP/STP instructions is reduced by ~7%. This means more memory > accesses and increased codesize, ~1.0% on average. > > To improve this, if a particular callee-save must be saved/restored, also > add the adjacent callee-save to allow use of LDP/STP. This significantly > reduces codesize (for example gcc_r, povray_r, parest_r, xalancbmk_r are > 1% smaller). This is a simple fix which can be backported. A more advanced > approach would scan blocks for pairs of callee-saves, but that requires a > rewrite of all the callee-save code which is too late at this stage. > > An example epilog in a shrinkwrapped function before: > > ldp x21, x22, [sp,#16] > ldr x23, [sp,#32] > ldr x24, [sp,#40] > ldp x25, x26, [sp,#48] > ldr x27, [sp,#64] > ldr x28, [sp,#72] > ldr x30, [sp,#80] > ldr d8, [sp,#88] > ldp x19, x20, [sp],#96 > ret > > And after this patch: > > ldr d8, [sp,#88] > ldp x21, x22, [sp,#16] > ldp x23, x24, [sp,#32] > ldp x25, x26, [sp,#48] > ldp x27, x28, [sp,#64] > ldr x30, [sp,#80] > ldp x19, x20, [sp],#96 > ret > > Passes bootstrap, OK for commit (and backport to GCC7)? OK. Thanks, James > > ChangeLog: > 2018-01-05 Wilco Dijkstra <wdijkstr@arm.com> > > * config/aarch64/aarch64.c (aarch64_components_for_bb): > Increase LDP/STP opportunities by adding adjacent callee-saves.
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 9735fc18402dd8fe2fa4022eef4c0522814a0552..da21032b19413d0361b8d30b51a31124eaaa31a1 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -3503,7 +3503,22 @@ aarch64_components_for_bb (basic_block bb) && (bitmap_bit_p (in, regno) || bitmap_bit_p (gen, regno) || bitmap_bit_p (kill, regno))) - bitmap_set_bit (components, regno); + { + unsigned regno2, offset, offset2; + bitmap_set_bit (components, regno); + + /* If there is a callee-save at an adjacent offset, add it too + to increase the use of LDP/STP. */ + offset = cfun->machine->frame.reg_offset[regno]; + regno2 = ((offset & 8) == 0) ? regno + 1 : regno - 1; + + if (regno2 <= LAST_SAVED_REGNUM) + { + offset2 = cfun->machine->frame.reg_offset[regno2]; + if ((offset & ~8) == (offset2 & ~8)) + bitmap_set_bit (components, regno2); + } + } return components; }