Message ID | HE1PR07MB09054A3F1F796882634605CEE4C50@HE1PR07MB0905.eurprd07.prod.outlook.com |
---|---|
State | New |
Headers | show |
Bernd Edlinger <bernd.edlinger@hotmail.de> writes: > this patch skips the redundant re-computing of the frame info after reload completed. > > I looked at all available targets initial_elimination_offset functions: > > All of them currently use either a trivial function of crtl->outgoing_args_size, > get_frame_size () > and df_regs_ever_live_p (x), that can be expected to be evaluated quickly and to be > constant, > or they use a cached value, if they are called when reload_completed is true. > Hi Bernd, For the sake of consistency with other targets then this looks fine. I'd both prefer the MIPS backend to be as efficient as other targets and not crash when something like initial elimination offset gets used post reload. Whether it is right or wrong that we end up with these calls post-reload is a different matter but I don't understand the detail enough to comment. > I believe this patch will both be a performance optimization and > guarantee that the frame info can not unexpectedly change when it > should not. I am a bit dubious about this comment as simply not re-computing the frame info is not quite the same as it being consistent with register usage etc. However, we have no check for this right now post-reload so there is no requirement to improve the situation. > I have successfully built a cross-compiler with this patch. > Is it OK for trunk? Has the patch been tested beyond just building GCC? I can do a test run for you if you don't have things set up to do one yourself. Thanks, Matthew
Matthew Fortune <Matthew.Fortune@imgtec.com> writes: > Has the patch been tested beyond just building GCC? I can do a > test run for you if you don't have things set up to do one yourself. I built a cross-gcc with all languages and a cross-glibc, but I have not set up an emulation environment, so if you could give it a test that would be highly welcome. Thanks Bernd.
Bernd Edlinger <bernd.edlinger@hotmail.de> writes: > Matthew Fortune <Matthew.Fortune@imgtec.com> writes: > > Has the patch been tested beyond just building GCC? I can do a > > test run for you if you don't have things set up to do one yourself. > > I built a cross-gcc with all languages and a cross-glibc, but I have > not set up an emulation environment, so if you could give it a test > that would be highly welcome. mipsel-linux-gnu test results are the same before and after this patch. Please go ahead and commit. Thanks, Matthew
[cc-ing Eric as RTL maintainer] Matthew Fortune <Matthew.Fortune@imgtec.com> writes: > Bernd Edlinger <bernd.edlinger@hotmail.de> writes: >> Matthew Fortune <Matthew.Fortune@imgtec.com> writes: >> > Has the patch been tested beyond just building GCC? I can do a >> > test run for you if you don't have things set up to do one yourself. >> >> I built a cross-gcc with all languages and a cross-glibc, but I have >> not set up an emulation environment, so if you could give it a test >> that would be highly welcome. > > mipsel-linux-gnu test results are the same before and after this patch. > > Please go ahead and commit. I still object to this. And it feels like the patch was posted as though it was a new one in order to avoid answering the objections that were raised when it was last posted: https://gcc.gnu.org/ml/gcc-patches/2015-12/msg02218.html IMO the problem is that rtx_addr_can_trap_p_1 duplicates a large bit of LRA/reload logic: ------------------------------------------------------------------------ /* Compute an approximation for the offset between the register FROM and TO for the current function, as it was at the start of the routine. */ static HOST_WIDE_INT get_initial_register_offset (int from, int to) { #ifdef ELIMINABLE_REGS static const struct elim_table_t { const int from; const int to; } table[] = ELIMINABLE_REGS; HOST_WIDE_INT offset1, offset2; unsigned int i, j; if (to == from) return 0; /* It is not safe to call INITIAL_ELIMINATION_OFFSET before the reload pass. We need to give at least an estimation for the resulting frame size. */ if (! reload_completed) { offset1 = crtl->outgoing_args_size + get_frame_size (); #if !STACK_GROWS_DOWNWARD offset1 = - offset1; #endif if (to == STACK_POINTER_REGNUM) return offset1; else if (from == STACK_POINTER_REGNUM) return - offset1; else return 0; } for (i = 0; i < ARRAY_SIZE (table); i++) if (table[i].from == from) { if (table[i].to == to) { INITIAL_ELIMINATION_OFFSET (table[i].from, table[i].to, offset1); return offset1; } for (j = 0; j < ARRAY_SIZE (table); j++) { if (table[j].to == to && table[j].from == table[i].to) { INITIAL_ELIMINATION_OFFSET (table[i].from, table[i].to, offset1); INITIAL_ELIMINATION_OFFSET (table[j].from, table[j].to, offset2); return offset1 + offset2; } if (table[j].from == to && table[j].to == table[i].to) { INITIAL_ELIMINATION_OFFSET (table[i].from, table[i].to, offset1); INITIAL_ELIMINATION_OFFSET (table[j].from, table[j].to, offset2); return offset1 - offset2; } } } else if (table[i].to == from) { if (table[i].from == to) { INITIAL_ELIMINATION_OFFSET (table[i].from, table[i].to, offset1); return - offset1; } for (j = 0; j < ARRAY_SIZE (table); j++) { if (table[j].to == to && table[j].from == table[i].from) { INITIAL_ELIMINATION_OFFSET (table[i].from, table[i].to, offset1); INITIAL_ELIMINATION_OFFSET (table[j].from, table[j].to, offset2); return - offset1 + offset2; } if (table[j].from == to && table[j].to == table[i].from) { INITIAL_ELIMINATION_OFFSET (table[i].from, table[i].to, offset1); INITIAL_ELIMINATION_OFFSET (table[j].from, table[j].to, offset2); return - offset1 - offset2; } } } /* If the requested register combination was not found, try a different more simple combination. */ if (from == ARG_POINTER_REGNUM) return get_initial_register_offset (HARD_FRAME_POINTER_REGNUM, to); else if (to == ARG_POINTER_REGNUM) return get_initial_register_offset (from, HARD_FRAME_POINTER_REGNUM); else if (from == HARD_FRAME_POINTER_REGNUM) return get_initial_register_offset (FRAME_POINTER_REGNUM, to); else if (to == HARD_FRAME_POINTER_REGNUM) return get_initial_register_offset (from, FRAME_POINTER_REGNUM); else return 0; #else HOST_WIDE_INT offset; if (to == from) return 0; if (reload_completed) { INITIAL_FRAME_POINTER_OFFSET (offset); } else { offset = crtl->outgoing_args_size + get_frame_size (); #if !STACK_GROWS_DOWNWARD offset = - offset; #endif } if (to == STACK_POINTER_REGNUM) return offset; else if (from == STACK_POINTER_REGNUM) return - offset; else return 0; #endif } ------------------------------------------------------------------------ Under the current interface macros like INITIAL_ELIMINATION_OFFSET are expected to trigger the calculation of the target's frame layout (since you need that information to answer the question). To me it seems wrong that we're attempting to call that sort of macro in a query routine like rtx_addr_can_trap_p_1. The frame layout is fixed for each LRA/reload cycle. The layout for the last of those cycles carries forward through the rest of compilation. IMO we should cache the information we need at the start of each LRA/reload cycle. This will be more robust, because there will be no accidental changes to global state either during or after LRA/reload. It will also be more efficient because rtx_addr_can_trap_p_1 can read the cached variables rather than calling back into the target. Thanks, Richard
On 26.01.2016 22:18, Richard Sandiford wrote: > [cc-ing Eric as RTL maintainer] > > Matthew Fortune <Matthew.Fortune@imgtec.com> writes: >> Bernd Edlinger <bernd.edlinger@hotmail.de> writes: >>> Matthew Fortune <Matthew.Fortune@imgtec.com> writes: >>>> Has the patch been tested beyond just building GCC? I can do a >>>> test run for you if you don't have things set up to do one yourself. >>> >>> I built a cross-gcc with all languages and a cross-glibc, but I have >>> not set up an emulation environment, so if you could give it a test >>> that would be highly welcome. >> >> mipsel-linux-gnu test results are the same before and after this patch. >> >> Please go ahead and commit. > > I still object to this. And it feels like the patch was posted > as though it was a new one in order to avoid answering the objections > that were raised when it was last posted: > > https://gcc.gnu.org/ml/gcc-patches/2015-12/msg02218.html > Richard, I am really sorry when you feel now like I did not take your objections seriously. Let me first explain what happened from my point of view: When I posted this response to your objections here: https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00235.html I was waiting for your response, but nothing happened, so I kind of forgot about the issue. In the meantime Ubuntu and Debian began to roll out GCC-6 and they got stuck at the same issue, but instead of waiting for pr69012 to be eventually resolved they created a duplicate pr69129, and then last Thursday Nick applied my initial patch without my intervention, probably because of the pressure they put on us. That changed significantly how I looked at the issue after that point, as there was no actual regression anymore, the revised patch still made sense, but for another reason. When you look at the 20+ targets in the gcc tree you'll see that almost all of them have a frame-layout computation function and all except mips have a shortcut "if (reload_completed) return;" in that function. And OTOH mips has one of the most complicated frame layout functions of all targets. For all of these reasons I posted a new patch which tries to resolve differences between mips and other targets inital_elimination_offset functions. I still think that it is not OK for the mips target to do the frame layout already in mips_frame_pointer_required because the frame layout will change several times, until reload is completed, and that function is only called right in the beginning. And I think that it is not really well-designed to have a frame layout function F(x,y)->z unless you assume F to be a trivial O(1) function that has no significant computation overhead. When you assume F to be an expensive O(N) or higher complexity you would design the interface like compute_frame_layout_now() and get_cached_initial_elimination_offset(x,y)->z. Also note, that with the current design, of initial_elimination_offset the frame layout is already computed several times, because reload has to call the function with different parameters, and there is no way how to know when the cached value can be used and when not. I do also think that having to cache the initial elimination offset values in the back-end that are already cached in the target does not improve anything. Then I would prefer to have a set of new target callbacks that makes it easy to access the cached target values. If you want to propose a patch for that I will not object, but I doubt it will be suitable for Stage 4. Thanks Bernd. > IMO the problem is that rtx_addr_can_trap_p_1 duplicates a large > bit of LRA/reload logic: > > ------------------------------------------------------------------------ > /* Compute an approximation for the offset between the register > FROM and TO for the current function, as it was at the start > of the routine. */ > > static HOST_WIDE_INT > get_initial_register_offset (int from, int to) > { > #ifdef ELIMINABLE_REGS > static const struct elim_table_t > { > const int from; > const int to; > } table[] = ELIMINABLE_REGS; > HOST_WIDE_INT offset1, offset2; > unsigned int i, j; > > if (to == from) > return 0; > > /* It is not safe to call INITIAL_ELIMINATION_OFFSET > before the reload pass. We need to give at least > an estimation for the resulting frame size. */ > if (! reload_completed) > { > offset1 = crtl->outgoing_args_size + get_frame_size (); > #if !STACK_GROWS_DOWNWARD > offset1 = - offset1; > #endif > if (to == STACK_POINTER_REGNUM) > return offset1; > else if (from == STACK_POINTER_REGNUM) > return - offset1; > else > return 0; > } > > for (i = 0; i < ARRAY_SIZE (table); i++) > if (table[i].from == from) > { > if (table[i].to == to) > { > INITIAL_ELIMINATION_OFFSET (table[i].from, table[i].to, > offset1); > return offset1; > } > for (j = 0; j < ARRAY_SIZE (table); j++) > { > if (table[j].to == to > && table[j].from == table[i].to) > { > INITIAL_ELIMINATION_OFFSET (table[i].from, table[i].to, > offset1); > INITIAL_ELIMINATION_OFFSET (table[j].from, table[j].to, > offset2); > return offset1 + offset2; > } > if (table[j].from == to > && table[j].to == table[i].to) > { > INITIAL_ELIMINATION_OFFSET (table[i].from, table[i].to, > offset1); > INITIAL_ELIMINATION_OFFSET (table[j].from, table[j].to, > offset2); > return offset1 - offset2; > } > } > } > else if (table[i].to == from) > { > if (table[i].from == to) > { > INITIAL_ELIMINATION_OFFSET (table[i].from, table[i].to, > offset1); > return - offset1; > } > for (j = 0; j < ARRAY_SIZE (table); j++) > { > if (table[j].to == to > && table[j].from == table[i].from) > { > INITIAL_ELIMINATION_OFFSET (table[i].from, table[i].to, > offset1); > INITIAL_ELIMINATION_OFFSET (table[j].from, table[j].to, > offset2); > return - offset1 + offset2; > } > if (table[j].from == to > && table[j].to == table[i].from) > { > INITIAL_ELIMINATION_OFFSET (table[i].from, table[i].to, > offset1); > INITIAL_ELIMINATION_OFFSET (table[j].from, table[j].to, > offset2); > return - offset1 - offset2; > } > } > } > > /* If the requested register combination was not found, > try a different more simple combination. */ > if (from == ARG_POINTER_REGNUM) > return get_initial_register_offset (HARD_FRAME_POINTER_REGNUM, to); > else if (to == ARG_POINTER_REGNUM) > return get_initial_register_offset (from, HARD_FRAME_POINTER_REGNUM); > else if (from == HARD_FRAME_POINTER_REGNUM) > return get_initial_register_offset (FRAME_POINTER_REGNUM, to); > else if (to == HARD_FRAME_POINTER_REGNUM) > return get_initial_register_offset (from, FRAME_POINTER_REGNUM); > else > return 0; > > #else > HOST_WIDE_INT offset; > > if (to == from) > return 0; > > if (reload_completed) > { > INITIAL_FRAME_POINTER_OFFSET (offset); > } > else > { > offset = crtl->outgoing_args_size + get_frame_size (); > #if !STACK_GROWS_DOWNWARD > offset = - offset; > #endif > } > > if (to == STACK_POINTER_REGNUM) > return offset; > else if (from == STACK_POINTER_REGNUM) > return - offset; > else > return 0; > > #endif > } > ------------------------------------------------------------------------ > > Under the current interface macros like INITIAL_ELIMINATION_OFFSET > are expected to trigger the calculation of the target's frame layout > (since you need that information to answer the question). > To me it seems wrong that we're attempting to call that sort of > macro in a query routine like rtx_addr_can_trap_p_1. > > The frame layout is fixed for each LRA/reload cycle. The layout > for the last of those cycles carries forward through the rest > of compilation. > > IMO we should cache the information we need at the start of each > LRA/reload cycle. This will be more robust, because there will > be no accidental changes to global state either during or after > LRA/reload. It will also be more efficient because > rtx_addr_can_trap_p_1 can read the cached variables rather > than calling back into the target. > > Thanks, > Richard >
> [cc-ing Eric as RTL maintainer] Sorry for the delay, the message apparently bounced] > IMO the problem is that rtx_addr_can_trap_p_1 duplicates a large > bit of LRA/reload logic: > > [...] > > Under the current interface macros like INITIAL_ELIMINATION_OFFSET > are expected to trigger the calculation of the target's frame layout > (since you need that information to answer the question). > To me it seems wrong that we're attempting to call that sort of > macro in a query routine like rtx_addr_can_trap_p_1. I'm a little uncomfortable stepping in here because, while I essentially share your objections and was opposed to the patch (I was almost sure that it would introduce maintainance issues for no practical benefit), I didn't imagine that such a failure mode would have been possible (computing an approximation of the frame layout after reload being problematic) so I didn't really object to being overruled after seeing Bernd's patch... > IMO we should cache the information we need@the start of each > LRA/reload cycle. This will be more robust, because there will > be no accidental changes to global state either during or after > LRA/reload. It will also be more efficient because > rtx_addr_can_trap_p_1 can read the cached variables rather > than calling back into the target. That would be a better design for sure and would eliminate the maintainance issues I was originally afraid of. My recommendation would be to back out Bernd's patch for GCC 6.0 (again, it doesn't fix any regression and, more importantly, any bug in real software, but only corner case bugs in dumb computer-generated testcases) but with the commitment to address the underlying issue for GCC 7.0 and backport the fix to GCC 6.x unless really impracticable. That being said, it's ultimately Jakub and Richard's call.
Bernd Edlinger <bernd.edlinger@hotmail.de> writes: > On 26.01.2016 22:18, Richard Sandiford wrote: >> [cc-ing Eric as RTL maintainer] >> >> Matthew Fortune <Matthew.Fortune@imgtec.com> writes: >>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes: >>>> Matthew Fortune <Matthew.Fortune@imgtec.com> writes: >>>>> Has the patch been tested beyond just building GCC? I can do a >>>>> test run for you if you don't have things set up to do one yourself. >>>> >>>> I built a cross-gcc with all languages and a cross-glibc, but I have >>>> not set up an emulation environment, so if you could give it a test >>>> that would be highly welcome. >>> >>> mipsel-linux-gnu test results are the same before and after this patch. >>> >>> Please go ahead and commit. >> >> I still object to this. And it feels like the patch was posted >> as though it was a new one in order to avoid answering the objections >> that were raised when it was last posted: >> >> https://gcc.gnu.org/ml/gcc-patches/2015-12/msg02218.html >> > > Richard, I am really sorry when you feel now like I did not take your > objections seriously. Let me first explain what happened from my point > of view: > > When I posted this response to your objections here: > > https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00235.html > > I was waiting for your response, but nothing happened, so I kind of > forgot about the issue. In the meantime Ubuntu and Debian began to > roll out GCC-6 and they got stuck at the same issue, but instead of > waiting for pr69012 to be eventually resolved they created a duplicate > pr69129, and then last Thursday Nick applied my initial patch without > my intervention, probably because of the pressure they put on us. Ah, I'd missed that, sorry. It's not obvious from the email thread that the patch was actually approved. I just see a message from Matthias saying that it worked for him and then a message from Nick saying that he'd applied it. Ah well. I guess at some point I have to get over the fact that I'm no longer the MIPS maintainer :-) > That changed significantly how I looked at the issue after that point, > as there was no actual regression anymore, the revised patch still made > sense, but for another reason. When you look at the 20+ targets in the > gcc tree you'll see that almost all of them have a frame-layout > computation function and all except mips have a shortcut > "if (reload_completed) return;" in that function. And OTOH mips has > one of the most complicated frame layout functions of all targets. > > For all of these reasons I posted a new patch which tries to resolve > differences between mips and other targets inital_elimination_offset > functions. OK. But the point still stands that the patch is only useful because we're now calling mips_compute_frame_info in cases where we wouldn't previously, because of the rtx_addr_can_trap_p changes. > I still think that it is not OK for the mips target to do the frame > layout already in mips_frame_pointer_required because the frame layout > will change several times, until reload is completed, and that function > is only called right in the beginning. I don't think it's any better or worse than doing the frame layout in INITIAL_ELIMINATION_OFFSET (which is common practice and pretty much required). They're both part of the initial setup phase -- targetm.frame_layout_required determines frame_pointer_needed, which is a vital input to the code that decides which eliminations to make. And this is an example of us doing the kind of caching that I was suggesting. Code that wants to know whether the frame pointer is needed for the current function should use frame_pointer_needed. Only the code that sets up frame_pointer_needed should call frame_pointer_required directly. > And I think that it is not really well-designed to have a frame layout > function F(x,y)->z unless you assume F to be a trivial O(1) function > that has no significant computation overhead. When you assume F to be > an expensive O(N) or higher complexity you would design the interface > like compute_frame_layout_now() and > get_cached_initial_elimination_offset(x,y)->z. > > Also note, that with the current design, of initial_elimination_offset > the frame layout is already computed several times, because reload has > to call the function with different parameters, and there is no way how > to know when the cached value can be used and when not. Agreed -- the current interface is pretty poor. But that's even more reason not to add uses of it after LRA/reload. Caching would be both simpler and more efficient. > I do also think that having to cache the initial elimination offset > values in the back-end that are already cached in the target does not > improve anything. Then I would prefer to have a set of new target > callbacks that makes it easy to access the cached target values. I don't agree. Asking each backend to cache a particular value in its frame_info structure and then providing a hook to query that cached value means adding code to every target. And calling an indirect function is going to much less efficient than accessing a structure field. It seems better to have one piece of code (or maybe two, until reload goes away) that caches the information that the target-independent code needs. I think that applies to other frame-related information too. At the moment there's no simple way for the postreload passes to tell which registers are available for use and which have to be left untouched, so we get complicated tests like: for (i = nregs - 1; i >= 0; --i) if (TEST_HARD_REG_BIT (this_unavailable, new_reg + i) || fixed_regs[new_reg + i] || global_regs[new_reg + i] /* Can't use regs which aren't saved by the prologue. */ || (! df_regs_ever_live_p (new_reg + i) && ! call_used_regs[new_reg + i]) #ifdef LEAF_REGISTERS /* We can't use a non-leaf register if we're in a leaf function. */ || (crtl->is_leaf && !LEAF_REGISTERS[new_reg + i]) #endif || ! HARD_REGNO_RENAME_OK (reg + i, new_reg + i)) return false; This is an example of relying (to some extent) on querying the target code every time we want to know something, but surely it would be better to have LRA/reload create a single "usable registers" HARD_REG_SET. (I have an old patch series for that but never found time to clean it up and submit it.) > If you want to propose a patch for that I will not object, but I doubt > it will be suitable for Stage 4. Fair enough. Guess I just have to learn to live with it. Thanks, Richard
On 01/28/2016 12:36 AM, Eric Botcazou wrote: >> [cc-ing Eric as RTL maintainer] > > Sorry for the delay, the message apparently bounced] > >> IMO the problem is that rtx_addr_can_trap_p_1 duplicates a large >> bit of LRA/reload logic: >> >> [...] >> >> Under the current interface macros like INITIAL_ELIMINATION_OFFSET >> are expected to trigger the calculation of the target's frame layout >> (since you need that information to answer the question). >> To me it seems wrong that we're attempting to call that sort of >> macro in a query routine like rtx_addr_can_trap_p_1. > > I'm a little uncomfortable stepping in here because, while I essentially share > your objections and was opposed to the patch (I was almost sure that it would > introduce maintainance issues for no practical benefit), I didn't imagine that > such a failure mode would have been possible (computing an approximation of > the frame layout after reload being problematic) so I didn't really object to > being overruled after seeing Bernd's patch... > >> IMO we should cache the information we need@the start of each >> LRA/reload cycle. This will be more robust, because there will >> be no accidental changes to global state either during or after >> LRA/reload. It will also be more efficient because >> rtx_addr_can_trap_p_1 can read the cached variables rather >> than calling back into the target. > > That would be a better design for sure and would eliminate the maintainance > issues I was originally afraid of. > > My recommendation would be to back out Bernd's patch for GCC 6.0 (again, it > doesn't fix any regression and, more importantly, any bug in real software, > but only corner case bugs in dumb computer-generated testcases) but with the > commitment to address the underlying issue for GCC 7.0 and backport the fix to > GCC 6.x unless really impracticable. That being said, it's ultimately Jakub > and Richard's call. I'm on the fence; I do think the original problem is an issue we should fix, but I'm also not terribly happy with the implementation we have right now. Besides the issues already mentioned, doesn't it kind of assume these offsets are constant (which they definitely aren't, consider arg pushes for example)? I think a better approach might be to just mark accesses at known locations in the frame, or arg pushes, as MEM_NOTRAP_P, and consider accesses with non-constant or calculated offsets as potentially trapping. Bernd
On 29.01.2016 02:09, Bernd Schmidt wrote: > On 01/28/2016 12:36 AM, Eric Botcazou wrote: >>> [cc-ing Eric as RTL maintainer] >> >> Sorry for the delay, the message apparently bounced] >> >>> IMO the problem is that rtx_addr_can_trap_p_1 duplicates a large >>> bit of LRA/reload logic: >>> >>> [...] >>> >>> Under the current interface macros like INITIAL_ELIMINATION_OFFSET >>> are expected to trigger the calculation of the target's frame layout >>> (since you need that information to answer the question). >>> To me it seems wrong that we're attempting to call that sort of >>> macro in a query routine like rtx_addr_can_trap_p_1. >> >> I'm a little uncomfortable stepping in here because, while I >> essentially share >> your objections and was opposed to the patch (I was almost sure that >> it would >> introduce maintainance issues for no practical benefit), I didn't >> imagine that >> such a failure mode would have been possible (computing an >> approximation of >> the frame layout after reload being problematic) so I didn't really >> object to >> being overruled after seeing Bernd's patch... >> >>> IMO we should cache the information we need@the start of each >>> LRA/reload cycle. This will be more robust, because there will >>> be no accidental changes to global state either during or after >>> LRA/reload. It will also be more efficient because >>> rtx_addr_can_trap_p_1 can read the cached variables rather >>> than calling back into the target. >> >> That would be a better design for sure and would eliminate the >> maintainance >> issues I was originally afraid of. >> >> My recommendation would be to back out Bernd's patch for GCC 6.0 >> (again, it >> doesn't fix any regression and, more importantly, any bug in real >> software, >> but only corner case bugs in dumb computer-generated testcases) but >> with the >> commitment to address the underlying issue for GCC 7.0 and backport >> the fix to >> GCC 6.x unless really impracticable. That being said, it's ultimately >> Jakub >> and Richard's call. > > I'm on the fence; I do think the original problem is an issue we should > fix, but I'm also not terribly happy with the implementation we have > right now. Besides the issues already mentioned, doesn't it kind of > assume these offsets are constant (which they definitely aren't, > consider arg pushes for example)? > Yes that is right. I saw it as a thing that could possibly happen more often than we know, because it is difficult to spot a wrong code by ordinary tests. Even the reproducer from pr61047 does not crash when it runs in the gcc testsuite, I had to tweak the example first to use a larger offset to compensate the large number of environment values that are passed from the test suite in each test execution. Yes, rtx_addr_can_trap_p_1 does not know how many bytes are pushed on the stack and I saw no easy way how to get to the REG_NOTES for instance, because they are attached to the INSN and rtx_addr_can_trap_p has only access to a MEM rtx. It is no exact science, but the error is on the safe side. Nevertheless, with the data from the target hook the approximation is good enough, to not pessimize any single bit of code that is generated in stage2 vs. stage3, which would not have been the case without some help from the target. > I think a better approach might be to just mark accesses at known > locations in the frame, or arg pushes, as MEM_NOTRAP_P, and consider > accesses with non-constant or calculated offsets as potentially trapping. > Yes I think also that might be a next step. It would probably be good to somehow fixate the result of rtx_addr_can_trap_p immediately before reload when the RTX's are still FRAMEP+x and ARGP+x, and annotate that somehow to the reloaded RTX's, that way it would finally be superfluous to call the target hook at all, because the actual addresses should not change during or after reload. It will probably have to be a spare bit on the RTX that is currently unused, because MEM_NOTRAP_P is already used for something different. It will however not be simple to find a valid piece of C code where the current implementation with all of its limitations generates different code compared to an implementation that has access to the exact offsets. As I said, I tried already, but could not find an example of a missed optimization due to my patch. Thanks Bernd. > > Bernd
On Fri, Jan 29, 2016 at 02:09:25AM +0100, Bernd Schmidt wrote: > I'm on the fence; I do think the original problem is an issue we should fix, > but I'm also not terribly happy with the implementation we have right now. The fact that it has been only reported from generated testcases only means we are lucky nobody encountered it yet in real-world programs. Plus, we need to be thankful to people working on those generators that keep reporting GCC bugs, they significantly improve the compiler. > Besides the issues already mentioned, doesn't it kind of assume these > offsets are constant (which they definitely aren't, consider arg pushes for > example)? Sure, for some registers the offsets aren't constant. In some cases we e.g. have REG_ARGS_SIZE notes, but in other cases don't and don't have anything else that would help us understand the difference between current sp value and one at the end of the prologue or so. > I think a better approach might be to just mark accesses at known locations > in the frame, or arg pushes, as MEM_NOTRAP_P, and consider accesses with > non-constant or calculated offsets as potentially trapping. I don't see how that would work generally. Sure, if there is e.g. a constant offset array access, it could be checked easily, but if there is variable offset array access, that is at some point later on changed into a constant offset access, you'd need to be conservative, unless you can prove it is in range. Or perhaps we could also use some other flag (or turn it into __builtin_trap or __builtin_unreachable or whatever) to mark MEMs that are always invalid if executed. Jakub
On 01/29/2016 04:41 PM, Jakub Jelinek wrote: > On Fri, Jan 29, 2016 at 02:09:25AM +0100, Bernd Schmidt wrote: >> I think a better approach might be to just mark accesses at known locations >> in the frame, or arg pushes, as MEM_NOTRAP_P, and consider accesses with >> non-constant or calculated offsets as potentially trapping. > > I don't see how that would work generally. Sure, if there is e.g. a constant > offset array access, it could be checked easily, but if there is variable offset > array access, that is at some point later on changed into a constant offset > access, you'd need to be conservative, unless you can prove it is in range. Yes. What is the problem with that? If we have (plus sfp const_int) at any point before reload, we can check whether that offset is inside frame_size. If it isn't or if the offset isn't known, it could trap. Bernd
On 28.01.2016 23:17, Richard Sandiford wrote: > Bernd Edlinger <bernd.edlinger@hotmail.de> writes: >> On 26.01.2016 22:18, Richard Sandiford wrote: >>> [cc-ing Eric as RTL maintainer] >>> >>> Matthew Fortune <Matthew.Fortune@imgtec.com> writes: >>>> Bernd Edlinger <bernd.edlinger@hotmail.de> writes: >>>>> Matthew Fortune <Matthew.Fortune@imgtec.com> writes: >>>>>> Has the patch been tested beyond just building GCC? I can do a >>>>>> test run for you if you don't have things set up to do one yourself. >>>>> >>>>> I built a cross-gcc with all languages and a cross-glibc, but I have >>>>> not set up an emulation environment, so if you could give it a test >>>>> that would be highly welcome. >>>> >>>> mipsel-linux-gnu test results are the same before and after this patch. >>>> >>>> Please go ahead and commit. >>> >>> I still object to this. And it feels like the patch was posted >>> as though it was a new one in order to avoid answering the objections >>> that were raised when it was last posted: >>> >>> https://gcc.gnu.org/ml/gcc-patches/2015-12/msg02218.html >>> >> >> Richard, I am really sorry when you feel now like I did not take your >> objections seriously. Let me first explain what happened from my point >> of view: >> >> When I posted this response to your objections here: >> >> https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00235.html >> >> I was waiting for your response, but nothing happened, so I kind of >> forgot about the issue. In the meantime Ubuntu and Debian began to >> roll out GCC-6 and they got stuck at the same issue, but instead of >> waiting for pr69012 to be eventually resolved they created a duplicate >> pr69129, and then last Thursday Nick applied my initial patch without >> my intervention, probably because of the pressure they put on us. > > Ah, I'd missed that, sorry. It's not obvious from the email thread > that the patch was actually approved. I just see a message from Matthias > saying that it worked for him and then a message from Nick saying that > he'd applied it. > > Ah well. I guess at some point I have to get over the fact that I'm no > longer the MIPS maintainer :-) > >> That changed significantly how I looked at the issue after that point, >> as there was no actual regression anymore, the revised patch still made >> sense, but for another reason. When you look at the 20+ targets in the >> gcc tree you'll see that almost all of them have a frame-layout >> computation function and all except mips have a shortcut >> "if (reload_completed) return;" in that function. And OTOH mips has >> one of the most complicated frame layout functions of all targets. >> >> For all of these reasons I posted a new patch which tries to resolve >> differences between mips and other targets inital_elimination_offset >> functions. > > OK. But the point still stands that the patch is only useful because > we're now calling mips_compute_frame_info in cases where we wouldn't > previously, because of the rtx_addr_can_trap_p changes. > Yes of course, but it cannot hurt to have all targets behave identical in such a central point. Even if at some point also the implementation in rtx_addr_can_trap_p will eventually be improved in one way or the other. >> I still think that it is not OK for the mips target to do the frame >> layout already in mips_frame_pointer_required because the frame layout >> will change several times, until reload is completed, and that function >> is only called right in the beginning. > > I don't think it's any better or worse than doing the frame layout in > INITIAL_ELIMINATION_OFFSET (which is common practice and pretty much > required). They're both part of the initial setup phase -- > targetm.frame_layout_required determines frame_pointer_needed, which is > a vital input to the code that decides which eliminations to make. > Hmm... That can also be a difference between LRA and traditional reload. Reload calls targetm.frame_pointer_required in update_eliminables, and can apparently handle 0=>1 and 1=>0 transitions here. While LRA calls frame_pointer_required only once in ira_setup_eliminable_regset and can only handle 0=>1 transitions of frame_pointer_needed in setup_can_eliminate, but not based on targetm.frame_pointer_required but targetm.can_eliminate(FRAME_POINTER_REGNUM, STACK_POINTER_REGNUM). At least it I read the source correctly, I could be wrong of course. > And this is an example of us doing the kind of caching that I was > suggesting. Code that wants to know whether the frame pointer is needed > for the current function should use frame_pointer_needed. Only the code > that sets up frame_pointer_needed should call frame_pointer_required > directly. > But frame_pointer_needed adds at least some value to the raw targetm.frame_pointer_required, while a cached result of initial_elimination_offset would not, just conserve the current interface exactly as it is. >> And I think that it is not really well-designed to have a frame layout >> function F(x,y)->z unless you assume F to be a trivial O(1) function >> that has no significant computation overhead. When you assume F to be >> an expensive O(N) or higher complexity you would design the interface >> like compute_frame_layout_now() and >> get_cached_initial_elimination_offset(x,y)->z. >> >> Also note, that with the current design, of initial_elimination_offset >> the frame layout is already computed several times, because reload has >> to call the function with different parameters, and there is no way how >> to know when the cached value can be used and when not. > > Agreed -- the current interface is pretty poor. But that's even more > reason not to add uses of it after LRA/reload. Caching would be both > simpler and more efficient. > >> I do also think that having to cache the initial elimination offset >> values in the back-end that are already cached in the target does not >> improve anything. Then I would prefer to have a set of new target >> callbacks that makes it easy to access the cached target values. > > I don't agree. Asking each backend to cache a particular value in > its frame_info structure and then providing a hook to query that > cached value means adding code to every target. And calling an > indirect function is going to much less efficient than accessing > a structure field. > > It seems better to have one piece of code (or maybe two, until > reload goes away) that caches the information that the target-independent > code needs. > It is for sure a matter of personal taste, nothing more. And yes, it is do-able, and not even really complicated, but it would not improve anything, just conserve the interface as it is (or as it was before I spoiled it :). I would rather like to improve something at this point. For instance I think it would be good to split the initial_elimination_offset function in an optional compute_initial_frame_layout function, doing only the O(N) computation and leave the actual O(1) querying of the result in the initial_elimination_offset function. If the target does not implement the compute_initial_frame_layout the default just does nothing, and the initial_elimination_offset function does not have to change, but if the target implements the optional compute_initial_frame_layout, it can rely on that function to be called immediately before the initial_elimination_offset, which will then be only an O(1) function. That would at least improve a tiny bit over the current state of the initial_elimination_offset function. Tanks Bernd. > I think that applies to other frame-related information too. At the > moment there's no simple way for the postreload passes to tell which > registers are available for use and which have to be left untouched, > so we get complicated tests like: > > for (i = nregs - 1; i >= 0; --i) > if (TEST_HARD_REG_BIT (this_unavailable, new_reg + i) > || fixed_regs[new_reg + i] > || global_regs[new_reg + i] > /* Can't use regs which aren't saved by the prologue. */ > || (! df_regs_ever_live_p (new_reg + i) > && ! call_used_regs[new_reg + i]) > #ifdef LEAF_REGISTERS > /* We can't use a non-leaf register if we're in a > leaf function. */ > || (crtl->is_leaf > && !LEAF_REGISTERS[new_reg + i]) > #endif > || ! HARD_REGNO_RENAME_OK (reg + i, new_reg + i)) > return false; > > This is an example of relying (to some extent) on querying the target > code every time we want to know something, but surely it would be > better to have LRA/reload create a single "usable registers" HARD_REG_SET. > (I have an old patch series for that but never found time to clean it up > and submit it.) > >> If you want to propose a patch for that I will not object, but I doubt >> it will be suitable for Stage 4. > > Fair enough. Guess I just have to learn to live with it. > > Thanks, > Richard >
On 29.01.2016 16:47 Bernd Schmidt wrote: > On 01/29/2016 04:41 PM, Jakub Jelinek wrote: >> On Fri, Jan 29, 2016 at 02:09:25AM +0100, Bernd Schmidt wrote: > >>> I think a better approach might be to just mark accesses at known >>> locations >>> in the frame, or arg pushes, as MEM_NOTRAP_P, and consider accesses with >>> non-constant or calculated offsets as potentially trapping. >> >> I don't see how that would work generally. Sure, if there is e.g. a >> constant >> offset array access, it could be checked easily, but if there is >> variable offset >> array access, that is at some point later on changed into a constant >> offset >> access, you'd need to be conservative, unless you can prove it is in >> range. > > Yes. What is the problem with that? If we have (plus sfp const_int) at > any point before reload, we can check whether that offset is inside > frame_size. If it isn't or if the offset isn't known, it could trap. > > Usually we have "if (x==1234) { read MEM[FP+x]; }", so wo don't know, and then after reload: "if (x==1234) { read MEM[SP+x+sp_fp_offset]; }" but wait, in the if statement we know, that x==1234, so everything turns in one magic constant, and we have a totally new constant offset from the SP register "if (x==1234) { read MEM[SP+1234+sp_fp_offset]; }". Now if rtx_addr_can_trap_p(MEM[SP+1234+sp_fp_offset]) says it cannot trap we think we do not need the if at all => BANG. Bernd.
On 01/29/2016 08:42 PM, Bernd Edlinger wrote: > On 29.01.2016 16:47 Bernd Schmidt wrote: >> >> Yes. What is the problem with that? If we have (plus sfp const_int) at >> any point before reload, we can check whether that offset is inside >> frame_size. If it isn't or if the offset isn't known, it could trap. >> >> > > Usually we have "if (x==1234) { read MEM[FP+x]; }", so wo don't know, > and then after reload: "if (x==1234) { read MEM[SP+x+sp_fp_offset]; }" > but wait, in the if statement we know, that x==1234, so everything > turns in one magic constant, and we have a totally new constant offset > from the SP register "if (x==1234) { read MEM[SP+1234+sp_fp_offset]; }". > Now if rtx_addr_can_trap_p(MEM[SP+1234+sp_fp_offset]) says it cannot > trap we think we do not need the if at all => BANG. What are you trying to say here? As far as I can tell this isn't a problem with my proposed solution (set MEM_NOTRAP_P for valid SFP+x offsets). Bernd
On Mon, Feb 1, 2016 at 1:18 PM, Bernd Schmidt <bschmidt@redhat.com> wrote: > On 01/29/2016 08:42 PM, Bernd Edlinger wrote: >> >> On 29.01.2016 16:47 Bernd Schmidt wrote: >>> >>> >>> Yes. What is the problem with that? If we have (plus sfp const_int) at >>> any point before reload, we can check whether that offset is inside >>> frame_size. If it isn't or if the offset isn't known, it could trap. >>> >>> >> >> Usually we have "if (x==1234) { read MEM[FP+x]; }", so wo don't know, >> and then after reload: "if (x==1234) { read MEM[SP+x+sp_fp_offset]; }" >> but wait, in the if statement we know, that x==1234, so everything >> turns in one magic constant, and we have a totally new constant offset >> from the SP register "if (x==1234) { read MEM[SP+1234+sp_fp_offset]; }". >> Now if rtx_addr_can_trap_p(MEM[SP+1234+sp_fp_offset]) says it cannot >> trap we think we do not need the if at all => BANG. > > > What are you trying to say here? As far as I can tell this isn't a problem > with my proposed solution (set MEM_NOTRAP_P for valid SFP+x offsets). What prevents motion of those across a stack adjustment (thus a place they are _not_ valid?) Richard. > > Bernd
On 02/01/2016 01:49 PM, Richard Biener wrote: > What prevents motion of those across a stack adjustment (thus a place > they are _not_ valid?) If the address is SP-based, dependencies on the address register. If you're thinking prologue stack adjustments, ports where this could be an issue emit suitable barrier insns. Bernd
On Mon, Feb 1, 2016 at 1:51 PM, Bernd Schmidt <bschmidt@redhat.com> wrote: > On 02/01/2016 01:49 PM, Richard Biener wrote: > >> What prevents motion of those across a stack adjustment (thus a place >> they are _not_ valid?) > > > If the address is SP-based, dependencies on the address register. If you're > thinking prologue stack adjustments, ports where this could be an issue emit > suitable barrier insns. Ok. Just I remember we can't mark non-trapping memory references in general as they might be non-trapping only conditional, aka if (p != NULL) *p = 1; I suppose stack accesses are special enough where issues like that can't pop up? I can think of foo (int i) { char *p = alloca(10); if (i < 10) p[i] = 1; } or stuff like that. But I guess your proposal is to handle the simple non-variable-indexed cases and leave the rest as always conservatively trapping. Richard. > > Bernd >
Hi, On 02/01/2016 13:18, Bernd Schmidt wrote: > On 01/29/2016 08:42 PM, Bernd Edlinger wrote: > > Usually we have "if (x==1234) { read MEM[FP+x]; }", so wo don't know, > > and then after reload: "if (x==1234) { read MEM[SP+x+sp_fp_offset]; }" > > but wait, in the if statement we know, that x==1234, so everything > > turns in one magic constant, and we have a totally new constant offset > > from the SP register "if (x==1234) { read MEM[SP+1234+sp_fp_offset]; }". > > Now if rtx_addr_can_trap_p(MEM[SP+1234+sp_fp_offset]) says it cannot > > trap we think we do not need the if at all => BANG. > > What are you trying to say here? As far as I can tell this isn't a > problem with my proposed solution (set MEM_NOTRAP_P for valid SFP+x > offsets). First, I think there are cases when x is still a pseudo before reload, and after reload, some constant propagation manages to replace x with a constant value. Then it will not help to know what rtx_addr_can_trap_p was before reload. Second, I do also think that it may be helpful to annotate the RTX during reload. MEM_NOTRAP_P is already used, so maybe something new like a spare bit would be necessary. But maybe a single bit will not be enough, what I would need, would be to somehow annotate the stack_pointer_rtx, that it has a known constant offset to the initial stack pointer value. The problem with that is, that stack_pointer_rtx is a shared pointer, and it is therefore not possible to modify it, and attach some context to it. Bernd.
2016-01-23 Bernd Edlinger <bernd.edlinger@hotmail.de> * config/mips/mips.c (mips_compute_frame_info): Skip re-computing the frame info after reload completed. Index: gcc/config/mips/mips.c =================================================================== --- gcc/config/mips/mips.c (revision 231954) +++ gcc/config/mips/mips.c (working copy) @@ -10321,6 +10321,10 @@ mips_compute_frame_info (void) HOST_WIDE_INT offset, size; unsigned int regno, i; + /* Skip re-computing the frame info after reload completed. */ + if (reload_completed) + return; + /* Set this function's interrupt properties. */ if (mips_interrupt_type_p (TREE_TYPE (current_function_decl))) {