Message ID | CAFULd4a6OpcLF6wU24YrEPL0nj2mdo11pHYPj2MTMzp_1T2ucQ@mail.gmail.com |
---|---|
State | New |
Headers | show |
Series | [middle-end] : Fix PR 88070, ICE in create_pre_exit, at mode-switching.c:438 | expand |
> The extra instruction is generated as a kludge in expand_function_end > to prevent instructions that may trap to be scheduled into function > epilogue. However, the same blockage is generated under exactly the > same conditions earlier in the expand_function_end. The first blockage > should prevent unwanted scheduling into the epilogue, so there is > actually no need for the second one. But there are instructions emitted after the first blockage, aren't there? Did you check the history of the code?
On Mon, Nov 19, 2018 at 11:42 PM Eric Botcazou <ebotcazou@adacore.com> wrote: > > > The extra instruction is generated as a kludge in expand_function_end > > to prevent instructions that may trap to be scheduled into function > > epilogue. However, the same blockage is generated under exactly the > > same conditions earlier in the expand_function_end. The first blockage > > should prevent unwanted scheduling into the epilogue, so there is > > actually no need for the second one. > > But there are instructions emitted after the first blockage, aren't there? Yes, but they don't generate exceptions. At least all memory accesses have /c flag. > Did you check the history of the code? I did. The blockage was introduced as a fix for PR14381 [1] in r79265 [2]. Later, the blockage was moved after return label as a fix for PR25176 [3] in r107871 [4]. After that, r122626 [5] moves the blockage after the label for the naked return from the function. Relevant posts from gcc-patches@ ML are at [6], [7]. However, in the posts, there are no concrete examples, how scheduler moves instructions from different BB around blockage insn, the posts just show that there is a jump around blockage when __builtin_return is used. I was under impression that scheduler is unable to move instructions over BB boundaries. A mystery is the tree-ssa merge [8] that copies back the hunk, moved in r122626 [5] to its original position. From this revision onwards, we emit two blockages. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14381 [2] https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=79265 [3] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25176 [4] https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=107871 [5] https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=122626 [6] https://gcc.gnu.org/ml/gcc-patches/2007-02/msg01143.html [7] https://gcc.gnu.org/ml/gcc-patches/2007-02/msg01623.html [8] https://gcc.gnu.org/viewcvs/gcc/trunk/gcc/function.c?limit_changes=0&r1=125624&r2=125623&pathrev=125624 Uros.
> The blockage was introduced as a fix for PR14381 [1] in r79265 [2]. > Later, the blockage was moved after return label as a fix for PR25176 > [3] in r107871 [4]. > > After that, r122626 [5] moves the blockage after the label for the > naked return from the function. Relevant posts from gcc-patches@ ML > are at [6], [7]. However, in the posts, there are no concrete > examples, how scheduler moves instructions from different BB around > blockage insn, the posts just show that there is a jump around > blockage when __builtin_return is used. I was under impression that > scheduler is unable to move instructions over BB boundaries. The scheduler works on extended basic blocks. The [7] post gives a rather convincing explanation and there is a C++ testcase under PR rtl-opt/14381. > A mystery is the tree-ssa merge [8] that copies back the hunk, moved > in r122626 [5] to its original position. From this revision onwards, > we emit two blockages. It's the dataflow merge, not the tree-ssa merge. The additional blockage might be needed for DF. Given that the current PR is totally artificial, I think that we need to be quite conservative and only do something on mainline. And even there I'd be rather conservative and remove the kludge only for targets that emit unwind information in the epilogue (among which there is x86 I presume).
On Tue, Nov 20, 2018 at 8:59 AM Eric Botcazou <ebotcazou@adacore.com> wrote: > > > The blockage was introduced as a fix for PR14381 [1] in r79265 [2]. > > Later, the blockage was moved after return label as a fix for PR25176 > > [3] in r107871 [4]. > > > > After that, r122626 [5] moves the blockage after the label for the > > naked return from the function. Relevant posts from gcc-patches@ ML > > are at [6], [7]. However, in the posts, there are no concrete > > examples, how scheduler moves instructions from different BB around > > blockage insn, the posts just show that there is a jump around > > blockage when __builtin_return is used. I was under impression that > > scheduler is unable to move instructions over BB boundaries. > > The scheduler works on extended basic blocks. The [7] post gives a rather > convincing explanation and there is a C++ testcase under PR rtl-opt/14381. > > > A mystery is the tree-ssa merge [8] that copies back the hunk, moved > > in r122626 [5] to its original position. From this revision onwards, > > we emit two blockages. > > It's the dataflow merge, not the tree-ssa merge. The additional blockage > might be needed for DF. > > Given that the current PR is totally artificial, I think that we need to be > quite conservative and only do something on mainline. And even there I'd be > rather conservative and remove the kludge only for targets that emit unwind > information in the epilogue (among which there is x86 I presume). Hm, I think I'll rather go with somehow target-dependent patch: --cut here-- diff --git a/gcc/mode-switching.c b/gcc/mode-switching.c index 370a49e90a9c..de75efe2b6c9 100644 --- a/gcc/mode-switching.c +++ b/gcc/mode-switching.c @@ -252,7 +252,21 @@ create_pre_exit (int n_entities, int *entity_map, const int *num_modes) if (EDGE_COUNT (EXIT_BLOCK_PTR_FOR_FN (cfun)->preds) == 1 && NONJUMP_INSN_P ((last_insn = BB_END (src_bb))) && GET_CODE (PATTERN (last_insn)) == USE - && GET_CODE ((ret_reg = XEXP (PATTERN (last_insn), 0))) == REG) + && GET_CODE ((ret_reg = XEXP (PATTERN (last_insn), 0))) == REG + + /* x86 targets use mode-switching infrastructure to + conditionally insert vzeroupper instruction at the exit + from the function and there is no need to switch the + mode before the return value copy. The vzeroupper insertion + pass runs after reload, so use !reload_completed as a stand-in + for x86 to skip the search for return value copy insn. + + N.b.: the code below assumes that return copy insn + immediately precedes its corresponding use insn. This + assumption does not hold after reload, since sched1 pass + can reschedule return copy insn away from its + corresponding use insn. */ + && !reload_completed) { int ret_start = REGNO (ret_reg); int nregs = REG_NREGS (ret_reg); --cut here-- WDYT? Uros.
On Tue, Nov 20, 2018 at 8:59 AM Eric Botcazou <ebotcazou@adacore.com> wrote: > > > The blockage was introduced as a fix for PR14381 [1] in r79265 [2]. > > Later, the blockage was moved after return label as a fix for PR25176 > > [3] in r107871 [4]. > > > > After that, r122626 [5] moves the blockage after the label for the > > naked return from the function. Relevant posts from gcc-patches@ ML > > are at [6], [7]. However, in the posts, there are no concrete > > examples, how scheduler moves instructions from different BB around > > blockage insn, the posts just show that there is a jump around > > blockage when __builtin_return is used. I was under impression that > > scheduler is unable to move instructions over BB boundaries. > > The scheduler works on extended basic blocks. The [7] post gives a rather > convincing explanation and there is a C++ testcase under PR rtl-opt/14381. Taking into account that BB edges aren't scheduling barriers, I agree with [7]. > > A mystery is the tree-ssa merge [8] that copies back the hunk, moved > > in r122626 [5] to its original position. From this revision onwards, > > we emit two blockages. > > It's the dataflow merge, not the tree-ssa merge. The additional blockage > might be needed for DF. Ah yes. Thanks for the correction. However, I think that - according to the reintroduced duplicated comment - the blockage, reintroduced by DF merge should not be needed. I'll investigate this a bit and try to bootstrap/regtest a patch that removes reintroduced blockage. > Given that the current PR is totally artificial, I think that we need to be > quite conservative and only do something on mainline. And even there I'd be > rather conservative and remove the kludge only for targets that emit unwind > information in the epilogue (among which there is x86 I presume). The testcase actually exposes -fschedule-insn problem with x86, so the test is not totaly artificial. The idea of removing the kludge for certain targets is tempting (the scheduler will be given some more freedom), but I agree that it should not be removed in stage-3. Thanks, Uros.
On 11/19/18 12:58 PM, Uros Bizjak wrote: > Hello! > > The assert in create_pre_exit at mode-switching.c expects return copy > pair with nothing in between. However, the compiler starts mode > switching pass with the following sequence: > > (insn 19 18 16 2 (set (reg:V2SF 21 xmm0) > (mem/c:V2SF (plus:DI (reg/f:DI 7 sp) > (const_int -72 [0xffffffffffffffb8])) [0 S8 A64])) > "pr88070.c":8 1157 {*movv2sf_internal} > (nil)) > (insn 16 19 20 2 (set (reg:V2SF 0 ax [orig:91 <retval> ] [91]) > (reg:V2SF 0 ax [89])) "pr88070.c":8 1157 {*movv2sf_internal} > (nil)) > (insn 20 16 21 2 (unspec_volatile [ > (const_int 0 [0]) > ] UNSPECV_BLOCKAGE) "pr88070.c":8 710 {blockage} > (nil)) > (insn 21 20 23 2 (use (reg:V2SF 21 xmm0)) "pr88070.c":8 -1 > (nil)) So I know there's an updated patch. But I thought it might be worth mentioning that insn 16 here appears to be a nop-move. Removing it might address this instance of the problem, but I doubt it's general enough to address any larger issues. You still might want to investigate why it's still in the IL. Jeff
On 11/20/18 3:24 AM, Uros Bizjak wrote: > On Tue, Nov 20, 2018 at 8:59 AM Eric Botcazou <ebotcazou@adacore.com> wrote: >> >>> The blockage was introduced as a fix for PR14381 [1] in r79265 [2]. >>> Later, the blockage was moved after return label as a fix for PR25176 >>> [3] in r107871 [4]. >>> >>> After that, r122626 [5] moves the blockage after the label for the >>> naked return from the function. Relevant posts from gcc-patches@ ML >>> are at [6], [7]. However, in the posts, there are no concrete >>> examples, how scheduler moves instructions from different BB around >>> blockage insn, the posts just show that there is a jump around >>> blockage when __builtin_return is used. I was under impression that >>> scheduler is unable to move instructions over BB boundaries. >> >> The scheduler works on extended basic blocks. The [7] post gives a rather >> convincing explanation and there is a C++ testcase under PR rtl-opt/14381. >> >>> A mystery is the tree-ssa merge [8] that copies back the hunk, moved >>> in r122626 [5] to its original position. From this revision onwards, >>> we emit two blockages. >> >> It's the dataflow merge, not the tree-ssa merge. The additional blockage >> might be needed for DF. >> >> Given that the current PR is totally artificial, I think that we need to be >> quite conservative and only do something on mainline. And even there I'd be >> rather conservative and remove the kludge only for targets that emit unwind >> information in the epilogue (among which there is x86 I presume). > > Hm, I think I'll rather go with somehow target-dependent patch: > > --cut here-- > diff --git a/gcc/mode-switching.c b/gcc/mode-switching.c > index 370a49e90a9c..de75efe2b6c9 100644 > --- a/gcc/mode-switching.c > +++ b/gcc/mode-switching.c > @@ -252,7 +252,21 @@ create_pre_exit (int n_entities, int *entity_map, > const int *num_modes) > if (EDGE_COUNT (EXIT_BLOCK_PTR_FOR_FN (cfun)->preds) == 1 > && NONJUMP_INSN_P ((last_insn = BB_END (src_bb))) > && GET_CODE (PATTERN (last_insn)) == USE > - && GET_CODE ((ret_reg = XEXP (PATTERN (last_insn), 0))) == REG) > + && GET_CODE ((ret_reg = XEXP (PATTERN (last_insn), 0))) == REG > + > + /* x86 targets use mode-switching infrastructure to > + conditionally insert vzeroupper instruction at the exit > + from the function and there is no need to switch the > + mode before the return value copy. The vzeroupper insertion > + pass runs after reload, so use !reload_completed as a stand-in > + for x86 to skip the search for return value copy insn. Note that the GCN target may well end up needing a late mode switching pass -- it's got kindof an inverse problem to solve -- where to place initializations of the exec register which is needed when we want to do scalar ops in a simd unit. I thought the SH used mode switching as well. BUt I can't recall if it was run before register allocation & reload. jeff
On Wed, Nov 21, 2018 at 12:50 AM Jeff Law <law@redhat.com> wrote: > > + /* x86 targets use mode-switching infrastructure to > > + conditionally insert vzeroupper instruction at the exit > > + from the function and there is no need to switch the > > + mode before the return value copy. The vzeroupper insertion > > + pass runs after reload, so use !reload_completed as a stand-in > > + for x86 to skip the search for return value copy insn. > Note that the GCN target may well end up needing a late mode switching > pass -- it's got kindof an inverse problem to solve -- where to place > initializations of the exec register which is needed when we want to do > scalar ops in a simd unit. > > I thought the SH used mode switching as well. BUt I can't recall if it > was run before register allocation & reload. SH mode switching ran before reload. Uros.
On Wed, Nov 21, 2018 at 12:46 AM Jeff Law <law@redhat.com> wrote: > > On 11/19/18 12:58 PM, Uros Bizjak wrote: > > Hello! > > > > The assert in create_pre_exit at mode-switching.c expects return copy > > pair with nothing in between. However, the compiler starts mode > > switching pass with the following sequence: > > > > (insn 19 18 16 2 (set (reg:V2SF 21 xmm0) > > (mem/c:V2SF (plus:DI (reg/f:DI 7 sp) > > (const_int -72 [0xffffffffffffffb8])) [0 S8 A64])) > > "pr88070.c":8 1157 {*movv2sf_internal} > > (nil)) > > (insn 16 19 20 2 (set (reg:V2SF 0 ax [orig:91 <retval> ] [91]) > > (reg:V2SF 0 ax [89])) "pr88070.c":8 1157 {*movv2sf_internal} > > (nil)) > > (insn 20 16 21 2 (unspec_volatile [ > > (const_int 0 [0]) > > ] UNSPECV_BLOCKAGE) "pr88070.c":8 710 {blockage} > > (nil)) > > (insn 21 20 23 2 (use (reg:V2SF 21 xmm0)) "pr88070.c":8 -1 > > (nil)) > So I know there's an updated patch. But I thought it might be worth > mentioning that insn 16 here appears to be a nop-move. Removing it > might address this instance of the problem, but I doubt it's general > enough to address any larger issues. > > You still might want to investigate why it's still in the IL. Oh yes, I remember this. These nop-moves were removed in Vlad's patch [1],[2]: 2013-10-25 Vladimir Makarov <vmakarov@redhat.com> ... * lra-spills.c (lra_final_code_change): Remove useless move insns. Which regressed vzeroupper insertion pass [3] that was reported in [4]. The functionality was later reverted in [5]: 2013-10-26 Vladimir Makarov <vmakarov@redhat.com> Revert: 2013-10-25 Vladimir Makarov <vmakarov@redhat.com> * lra-spills.c (lra_final_code_change): Remove useless move insns. Which IMO can be reintroduced back, now that vzeroupper pass works in a different way. We actually have a couple of tests in place for PR58679 [6]. [1] https://gcc.gnu.org/ml/gcc-patches/2013-10/msg02208.html [2] https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=204079 [3] https://gcc.gnu.org/ml/gcc-patches/2013-10/msg02225.html [4] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58679 [5] https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=204094 [6] https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=204109 Uros.
On Wed, Nov 21, 2018 at 10:48 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Wed, Nov 21, 2018 at 12:46 AM Jeff Law <law@redhat.com> wrote: > > > > On 11/19/18 12:58 PM, Uros Bizjak wrote: > > > Hello! > > > > > > The assert in create_pre_exit at mode-switching.c expects return copy > > > pair with nothing in between. However, the compiler starts mode > > > switching pass with the following sequence: > > > > > > (insn 19 18 16 2 (set (reg:V2SF 21 xmm0) > > > (mem/c:V2SF (plus:DI (reg/f:DI 7 sp) > > > (const_int -72 [0xffffffffffffffb8])) [0 S8 A64])) > > > "pr88070.c":8 1157 {*movv2sf_internal} > > > (nil)) > > > (insn 16 19 20 2 (set (reg:V2SF 0 ax [orig:91 <retval> ] [91]) > > > (reg:V2SF 0 ax [89])) "pr88070.c":8 1157 {*movv2sf_internal} > > > (nil)) > > > (insn 20 16 21 2 (unspec_volatile [ > > > (const_int 0 [0]) > > > ] UNSPECV_BLOCKAGE) "pr88070.c":8 710 {blockage} > > > (nil)) > > > (insn 21 20 23 2 (use (reg:V2SF 21 xmm0)) "pr88070.c":8 -1 > > > (nil)) > > So I know there's an updated patch. But I thought it might be worth > > mentioning that insn 16 here appears to be a nop-move. Removing it > > might address this instance of the problem, but I doubt it's general > > enough to address any larger issues. > > > > You still might want to investigate why it's still in the IL. > > Oh yes, I remember this. > > These nop-moves were removed in Vlad's patch [1],[2]: > > 2013-10-25 Vladimir Makarov <vmakarov@redhat.com> > > ... > * lra-spills.c (lra_final_code_change): Remove useless move insns. > > Which regressed vzeroupper insertion pass [3] that was reported in [4]. > > The functionality was later reverted in [5]: > > 2013-10-26 Vladimir Makarov <vmakarov@redhat.com> > > Revert: > 2013-10-25 Vladimir Makarov <vmakarov@redhat.com> > * lra-spills.c (lra_final_code_change): Remove useless move insns. > > Which IMO can be reintroduced back, now that vzeroupper pass works in > a different way. We actually have a couple of tests in place for > PR58679 [6]. The revert of the revert works OK for PR58679 tests with the latest compiler. Uros.
Index: function.c =================================================================== --- function.c (revision 266278) +++ function.c (working copy) @@ -5447,13 +5447,6 @@ expand_function_end (void) if (naked_return_label) emit_label (naked_return_label); - /* @@@ This is a kludge. We want to ensure that instructions that - may trap are not moved into the epilogue by scheduling, because - we don't always emit unwind information for the epilogue. */ - if (cfun->can_throw_non_call_exceptions - && targetm_common.except_unwind_info (&global_options) != UI_SJLJ) - emit_insn (gen_blockage ()); - /* If stack protection is enabled for this function, check the guard. */ if (crtl->stack_protect_guard && targetm.stack_protect_runtime_enabled_p ()) stack_protect_epilogue (); Index: testsuite/gcc.target/i386/pr88070.c =================================================================== --- testsuite/gcc.target/i386/pr88070.c (nonexistent) +++ testsuite/gcc.target/i386/pr88070.c (working copy) @@ -0,0 +1,12 @@ +/* PR target/88070 */ +/* { dg-do compile } */ +/* { dg-options "-O -fexpensive-optimizations -fnon-call-exceptions -fschedule-insns -fno-dce -fno-dse -mavx" } */ + +typedef float vfloat2 __attribute__ ((__vector_size__ (2 * sizeof (float)))); + +vfloat2 +test1float2 (float c) +{ + vfloat2 v = { c, c }; + return v; +}