Message ID | CAMZc-bwzYQWvQH7sbDczJg6hWktBqA7Pdcbp-FtAnza1By8TAQ@mail.gmail.com |
---|---|
State | New |
Headers | show |
Series | [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735] | expand |
On Thu, May 13, 2021 at 11:18 AM Hongtao Liu <crazylht@gmail.com> wrote: > > Hi: > When __builtin_ia32_vzeroupper is called explicitly, the corresponding > vzeroupper pattern does not carry any CLOBBERS or SETs before LRA, > which leads to incorrect optimization in pass_reload. > In order to solve this problem, this patch introduces a pre_reload > splitter which adds CLOBBERS to vzeroupper's pattern, it can solve the > problem in pr. > > At the same time, in order to optimize the low 128 bits in > post_reload CSE, this patch also transforms those CLOBBERS to SETs in > pass_vzeroupper. > > It works fine except for TARGET_64BIT_MS_ABI, under which xmm6-xmm15 > are callee-saved, so even if there're no other uses of xmm6-xmm15 in the > function, because of vzeroupper's pattern, pro_epilog will save and > restore those registers, which is obviously redundant. In order to > eliminate this redundancy, a post_reload splitter is introduced, which > drops those SETs, until epilogue_completed splitter adds those SETs > back, it looks to be safe since there's no CSE between post_reload > split2 and epilogue_completed split3??? Also frame info needs to be > updated in pro_epilog, which saves and restores xmm6-xmm15 only if > there's usage other than explicit vzeroupper pattern. > > Bootstrapped and regtested on X86_64-linux-gnu{-m32,} > Ok for trunk? Some time ago a support for CLOBBER_HIGH RTX was added (and later removed for some reason). Perhaps we could resurrect the patch for the purpose of ferrying 128bit modes via vzeroupper RTX? +(define_split + [(match_parallel 0 "vzeroupper_pattern" + [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] + "TARGET_AVX && ix86_pre_reload_split ()" + [(match_dup 0)] +{ + /* When vzeroupper is explictly used, for LRA purpose, make it clear + the instruction kills sse registers. */ + gcc_assert (cfun->machine->has_explicit_vzeroupper); + unsigned int nregs = TARGET_64BIT ? 16 : 8; + rtvec vec = rtvec_alloc (nregs + 1); + RTVEC_ELT (vec, 0) = gen_rtx_UNSPEC_VOLATILE (VOIDmode, + gen_rtvec (1, const1_rtx), + UNSPECV_VZEROUPPER); + for (unsigned int i = 0; i < nregs; ++i) + { + unsigned int regno = GET_SSE_REGNO (i); + rtx reg = gen_rtx_REG (V2DImode, regno); + RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); + } + operands[0] = gen_rtx_PARALLEL (VOIDmode, vec); +}) Wouldn't this also kill lower 128bit values that are not touched by vzeroupper? A CLOBBER_HIGH would be more appropriate here. Uros. > gcc/ChangeLog: > > PR target/82735 > * config/i386/i386-expand.c (ix86_expand_builtin): Count > number of __builtin_ia32_vzeroupper. > * config/i386/i386-features.c (ix86_add_reg_usage_to_vzerouppers): > Transform CLOBBERs to SETs for explicit vzeroupper pattern so > that CSE can optimize lower 128 bits. > * config/i386/i386.c (ix86_handle_explicit_vzeroupper_in_pro_epilog): > New. > (ix86_save_reg): If there's no use of xmm6~xmm15 other than > explicit vzeroupper under TARGET_64BIT_MS_ABI, no need to save > REGNO. > (ix86_finalize_stack_frame_flags): Recompute frame layout if > there's explicit vzeroupper under TARGET_64BIT_MS_ABI. > * config/i386/i386.h (struct machine_function): Change type of > has_explicit_vzeroupper from BOOL_BITFILED to unsigned int. > * config/i386/sse.md (*avx_vzeroupper_2): New post-reload > splitter which will drop all SETs for explicit vzeroupper > patterns. > (*avx_vzeroupper_1): Generate SET reg to reg instead of > CLOBBER, and add pre-reload splitter after it. > > gcc/testsuite/ChangeLog: > > PR target/82735 > * gcc.target/i386/pr82735-1.c: New test. > * gcc.target/i386/pr82735-2.c: New test. > * gcc.target/i386/pr82735-3.c: New test. > * gcc.target/i386/pr82735-4.c: New test. > * gcc.target/i386/pr82735-5.c: New test. > > > -- > BR, > Hongtao
On Thu, May 13, 2021 at 11:40 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Thu, May 13, 2021 at 11:18 AM Hongtao Liu <crazylht@gmail.com> wrote: > > > > Hi: > > When __builtin_ia32_vzeroupper is called explicitly, the corresponding > > vzeroupper pattern does not carry any CLOBBERS or SETs before LRA, > > which leads to incorrect optimization in pass_reload. > > In order to solve this problem, this patch introduces a pre_reload > > splitter which adds CLOBBERS to vzeroupper's pattern, it can solve the > > problem in pr. > > > > At the same time, in order to optimize the low 128 bits in > > post_reload CSE, this patch also transforms those CLOBBERS to SETs in > > pass_vzeroupper. > > > > It works fine except for TARGET_64BIT_MS_ABI, under which xmm6-xmm15 > > are callee-saved, so even if there're no other uses of xmm6-xmm15 in the > > function, because of vzeroupper's pattern, pro_epilog will save and > > restore those registers, which is obviously redundant. In order to > > eliminate this redundancy, a post_reload splitter is introduced, which > > drops those SETs, until epilogue_completed splitter adds those SETs > > back, it looks to be safe since there's no CSE between post_reload > > split2 and epilogue_completed split3??? Also frame info needs to be > > updated in pro_epilog, which saves and restores xmm6-xmm15 only if > > there's usage other than explicit vzeroupper pattern. > > > > Bootstrapped and regtested on X86_64-linux-gnu{-m32,} > > Ok for trunk? > > Some time ago a support for CLOBBER_HIGH RTX was added (and later > removed for some reason). Perhaps we could resurrect the patch for the > purpose of ferrying 128bit modes via vzeroupper RTX? https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html Uros. > > +(define_split > + [(match_parallel 0 "vzeroupper_pattern" > + [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] > + "TARGET_AVX && ix86_pre_reload_split ()" > + [(match_dup 0)] > +{ > + /* When vzeroupper is explictly used, for LRA purpose, make it clear > + the instruction kills sse registers. */ > + gcc_assert (cfun->machine->has_explicit_vzeroupper); > + unsigned int nregs = TARGET_64BIT ? 16 : 8; > + rtvec vec = rtvec_alloc (nregs + 1); > + RTVEC_ELT (vec, 0) = gen_rtx_UNSPEC_VOLATILE (VOIDmode, > + gen_rtvec (1, const1_rtx), > + UNSPECV_VZEROUPPER); > + for (unsigned int i = 0; i < nregs; ++i) > + { > + unsigned int regno = GET_SSE_REGNO (i); > + rtx reg = gen_rtx_REG (V2DImode, regno); > + RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); > + } > + operands[0] = gen_rtx_PARALLEL (VOIDmode, vec); > +}) > > Wouldn't this also kill lower 128bit values that are not touched by > vzeroupper? A CLOBBER_HIGH would be more appropriate here. > > Uros. > > > > gcc/ChangeLog: > > > > PR target/82735 > > * config/i386/i386-expand.c (ix86_expand_builtin): Count > > number of __builtin_ia32_vzeroupper. > > * config/i386/i386-features.c (ix86_add_reg_usage_to_vzerouppers): > > Transform CLOBBERs to SETs for explicit vzeroupper pattern so > > that CSE can optimize lower 128 bits. > > * config/i386/i386.c (ix86_handle_explicit_vzeroupper_in_pro_epilog): > > New. > > (ix86_save_reg): If there's no use of xmm6~xmm15 other than > > explicit vzeroupper under TARGET_64BIT_MS_ABI, no need to save > > REGNO. > > (ix86_finalize_stack_frame_flags): Recompute frame layout if > > there's explicit vzeroupper under TARGET_64BIT_MS_ABI. > > * config/i386/i386.h (struct machine_function): Change type of > > has_explicit_vzeroupper from BOOL_BITFILED to unsigned int. > > * config/i386/sse.md (*avx_vzeroupper_2): New post-reload > > splitter which will drop all SETs for explicit vzeroupper > > patterns. > > (*avx_vzeroupper_1): Generate SET reg to reg instead of > > CLOBBER, and add pre-reload splitter after it. > > > > gcc/testsuite/ChangeLog: > > > > PR target/82735 > > * gcc.target/i386/pr82735-1.c: New test. > > * gcc.target/i386/pr82735-2.c: New test. > > * gcc.target/i386/pr82735-3.c: New test. > > * gcc.target/i386/pr82735-4.c: New test. > > * gcc.target/i386/pr82735-5.c: New test. > > > > > > -- > > BR, > > Hongtao
On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote: > > > Bootstrapped and regtested on X86_64-linux-gnu{-m32,} > > > Ok for trunk? > > > > Some time ago a support for CLOBBER_HIGH RTX was added (and later > > removed for some reason). Perhaps we could resurrect the patch for the > > purpose of ferrying 128bit modes via vzeroupper RTX? > > https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html is where it got removed, CCing Richard. > > +(define_split > > + [(match_parallel 0 "vzeroupper_pattern" > > + [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] > > + "TARGET_AVX && ix86_pre_reload_split ()" > > + [(match_dup 0)] > > +{ > > + /* When vzeroupper is explictly used, for LRA purpose, make it clear > > + the instruction kills sse registers. */ > > + gcc_assert (cfun->machine->has_explicit_vzeroupper); > > + unsigned int nregs = TARGET_64BIT ? 16 : 8; > > + rtvec vec = rtvec_alloc (nregs + 1); > > + RTVEC_ELT (vec, 0) = gen_rtx_UNSPEC_VOLATILE (VOIDmode, > > + gen_rtvec (1, const1_rtx), > > + UNSPECV_VZEROUPPER); > > + for (unsigned int i = 0; i < nregs; ++i) > > + { > > + unsigned int regno = GET_SSE_REGNO (i); > > + rtx reg = gen_rtx_REG (V2DImode, regno); > > + RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); > > + } > > + operands[0] = gen_rtx_PARALLEL (VOIDmode, vec); > > +}) > > > > Wouldn't this also kill lower 128bit values that are not touched by > > vzeroupper? A CLOBBER_HIGH would be more appropriate here. Yes, it would. But normally the only xmm* hard regs live across the explicit user vzeroupper would be local and global register variables, I think the 1st scheduler etc. shouldn't extend lifetime of the xmm hard regs across UNSPEC_VOLATILE. Jakub
Jakub Jelinek <jakub@redhat.com> writes: > On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote: >> > > Bootstrapped and regtested on X86_64-linux-gnu{-m32,} >> > > Ok for trunk? >> > >> > Some time ago a support for CLOBBER_HIGH RTX was added (and later >> > removed for some reason). Perhaps we could resurrect the patch for the >> > purpose of ferrying 128bit modes via vzeroupper RTX? >> >> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html > > https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html > is where it got removed, CCing Richard. Yeah. Initially clobber_high seemed like the best appraoch for handling the tlsdesc thing, but in practice it was too difficult to shoe-horn the concept in after the fact, when so much rtl infrastructure wasn't prepared to deal with it. The old support didn't handle all cases and passes correctly, and handled others suboptimally. I think it would be worth using the same approach as https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for vzeroupper: represent the instructions as call_insns in which the call has a special vzeroupper ABI. I think that's likely to lead to better code than clobber_high would (or at least, it did for tlsdesc). Thanks, Richard
On Thu, May 13, 2021 at 12:32:26PM +0100, Richard Sandiford wrote: > Jakub Jelinek <jakub@redhat.com> writes: > > On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote: > >> > > Bootstrapped and regtested on X86_64-linux-gnu{-m32,} > >> > > Ok for trunk? > >> > > >> > Some time ago a support for CLOBBER_HIGH RTX was added (and later > >> > removed for some reason). Perhaps we could resurrect the patch for the > >> > purpose of ferrying 128bit modes via vzeroupper RTX? > >> > >> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html > > > > https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html > > is where it got removed, CCing Richard. > > Yeah. Initially clobber_high seemed like the best appraoch for > handling the tlsdesc thing, but in practice it was too difficult > to shoe-horn the concept in after the fact, when so much rtl > infrastructure wasn't prepared to deal with it. The old support > didn't handle all cases and passes correctly, and handled others > suboptimally. > > I think it would be worth using the same approach as > https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for > vzeroupper: represent the instructions as call_insns in which the > call has a special vzeroupper ABI. I think that's likely to lead > to better code than clobber_high would (or at least, it did for tlsdesc). Perhaps a magic call_insn that is split post-reload into a normal insn with the sets then? Jakub
Jakub Jelinek <jakub@redhat.com> writes: > On Thu, May 13, 2021 at 12:32:26PM +0100, Richard Sandiford wrote: >> Jakub Jelinek <jakub@redhat.com> writes: >> > On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote: >> >> > > Bootstrapped and regtested on X86_64-linux-gnu{-m32,} >> >> > > Ok for trunk? >> >> > >> >> > Some time ago a support for CLOBBER_HIGH RTX was added (and later >> >> > removed for some reason). Perhaps we could resurrect the patch for the >> >> > purpose of ferrying 128bit modes via vzeroupper RTX? >> >> >> >> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html >> > >> > https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html >> > is where it got removed, CCing Richard. >> >> Yeah. Initially clobber_high seemed like the best appraoch for >> handling the tlsdesc thing, but in practice it was too difficult >> to shoe-horn the concept in after the fact, when so much rtl >> infrastructure wasn't prepared to deal with it. The old support >> didn't handle all cases and passes correctly, and handled others >> suboptimally. >> >> I think it would be worth using the same approach as >> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for >> vzeroupper: represent the instructions as call_insns in which the >> call has a special vzeroupper ABI. I think that's likely to lead >> to better code than clobber_high would (or at least, it did for tlsdesc). > > Perhaps a magic call_insn that is split post-reload into a normal insn > with the sets then? I'd be tempted to treat it is a call_insn throughout. The unspec_volatile means that we can't move the instruction, so converting a call_insn to an insn isn't likely to help from that point of view. The sets are also likely to be handled suboptimally compared to the more accurate register information attached to the call: all code that handles calls has to be prepared to deal with partial clobbers, whereas most code dealing with sets will assume that the set does useful work, and that the rhs of the set is live. Thanks, Richard
On Thu, May 13, 2021 at 7:52 PM Richard Sandiford <richard.sandiford@arm.com> wrote: > > Jakub Jelinek <jakub@redhat.com> writes: > > On Thu, May 13, 2021 at 12:32:26PM +0100, Richard Sandiford wrote: > >> Jakub Jelinek <jakub@redhat.com> writes: > >> > On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote: > >> >> > > Bootstrapped and regtested on X86_64-linux-gnu{-m32,} > >> >> > > Ok for trunk? > >> >> > > >> >> > Some time ago a support for CLOBBER_HIGH RTX was added (and later > >> >> > removed for some reason). Perhaps we could resurrect the patch for the > >> >> > purpose of ferrying 128bit modes via vzeroupper RTX? > >> >> > >> >> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html > >> > > >> > https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html > >> > is where it got removed, CCing Richard. > >> > >> Yeah. Initially clobber_high seemed like the best appraoch for > >> handling the tlsdesc thing, but in practice it was too difficult > >> to shoe-horn the concept in after the fact, when so much rtl > >> infrastructure wasn't prepared to deal with it. The old support > >> didn't handle all cases and passes correctly, and handled others > >> suboptimally. > >> > >> I think it would be worth using the same approach as > >> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for > >> vzeroupper: represent the instructions as call_insns in which the > >> call has a special vzeroupper ABI. I think that's likely to lead > >> to better code than clobber_high would (or at least, it did for tlsdesc). From an implementation perspective, I guess you're meaning we should implement TARGET_INSN_CALLEE_ABI and TARGET_FNTYPE_ABI in the i386 backend. > > > > Perhaps a magic call_insn that is split post-reload into a normal insn > > with the sets then? > > I'd be tempted to treat it is a call_insn throughout. The unspec_volatile > means that we can't move the instruction, so converting a call_insn to an > insn isn't likely to help from that point of view. The sets are also > likely to be handled suboptimally compared to the more accurate register > information attached to the call: all code that handles calls has to be > prepared to deal with partial clobbers, whereas most code dealing with > sets will assume that the set does useful work, and that the rhs of the > set is live. > > Thanks, > Richard >
On Fri, May 14, 2021 at 10:27 AM Hongtao Liu <crazylht@gmail.com> wrote: > > On Thu, May 13, 2021 at 7:52 PM Richard Sandiford > <richard.sandiford@arm.com> wrote: > > > > Jakub Jelinek <jakub@redhat.com> writes: > > > On Thu, May 13, 2021 at 12:32:26PM +0100, Richard Sandiford wrote: > > >> Jakub Jelinek <jakub@redhat.com> writes: > > >> > On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote: > > >> >> > > Bootstrapped and regtested on X86_64-linux-gnu{-m32,} > > >> >> > > Ok for trunk? > > >> >> > > > >> >> > Some time ago a support for CLOBBER_HIGH RTX was added (and later > > >> >> > removed for some reason). Perhaps we could resurrect the patch for the > > >> >> > purpose of ferrying 128bit modes via vzeroupper RTX? > > >> >> > > >> >> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html > > >> > > > >> > https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html > > >> > is where it got removed, CCing Richard. > > >> > > >> Yeah. Initially clobber_high seemed like the best appraoch for > > >> handling the tlsdesc thing, but in practice it was too difficult > > >> to shoe-horn the concept in after the fact, when so much rtl > > >> infrastructure wasn't prepared to deal with it. The old support > > >> didn't handle all cases and passes correctly, and handled others > > >> suboptimally. > > >> > > >> I think it would be worth using the same approach as > > >> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for > > >> vzeroupper: represent the instructions as call_insns in which the > > >> call has a special vzeroupper ABI. I think that's likely to lead > > >> to better code than clobber_high would (or at least, it did for tlsdesc). > > From an implementation perspective, I guess you're meaning we should > implement TARGET_INSN_CALLEE_ABI and TARGET_FNTYPE_ABI in the i386 > backend. > When I implemented the vzeroupper pattern as call_insn and defined TARGET_INSN_CALLEE_ABI for it, I got several failures. they're related to 2 parts 1. requires_stack_frame_p return true for vzeroupper which should be false. 2. in subst_stack_regs, vzeroupper shouldn't kill arguments I've tried a rough patch like below, it works for those failures, unfortunately, I don't have an arm machine to test, so I want to ask would the below change break something in the arm backend? modified gcc/reg-stack.c @@ -174,6 +174,7 @@ #include "reload.h" #include "tree-pass.h" #include "rtl-iter.h" +#include "function-abi.h" #ifdef STACK_REGS @@ -2385,7 +2386,7 @@ subst_stack_regs (rtx_insn *insn, stack_ptr regstack) bool control_flow_insn_deleted = false; int i; - if (CALL_P (insn)) + if (CALL_P (insn) && insn_callee_abi (insn).id () == 0) { int top = regstack->top; modified gcc/shrink-wrap.c @@ -58,7 +58,12 @@ requires_stack_frame_p (rtx_insn *insn, HARD_REG_SET prologue_used, unsigned regno; if (CALL_P (insn)) - return !SIBLING_CALL_P (insn); + { + if (insn_callee_abi (insn).id() != 0) + return false; + else + return !SIBLING_CALL_P (insn); + } /* We need a frame to get the unique CFA expected by the unwinder. */ if (cfun->can_throw_non_call_exceptions && can_throw_internal (insn)) > > > > > > Perhaps a magic call_insn that is split post-reload into a normal insn > > > with the sets then? > > > > I'd be tempted to treat it is a call_insn throughout. The unspec_volatile > > means that we can't move the instruction, so converting a call_insn to an > > insn isn't likely to help from that point of view. The sets are also > > likely to be handled suboptimally compared to the more accurate register > > information attached to the call: all code that handles calls has to be > > prepared to deal with partial clobbers, whereas most code dealing with > > sets will assume that the set does useful work, and that the rhs of the > > set is live. > > > > Thanks, > > Richard > > > > > -- > BR, > Hongtao
Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > On Fri, May 14, 2021 at 10:27 AM Hongtao Liu <crazylht@gmail.com> wrote: >> >> On Thu, May 13, 2021 at 7:52 PM Richard Sandiford >> <richard.sandiford@arm.com> wrote: >> > >> > Jakub Jelinek <jakub@redhat.com> writes: >> > > On Thu, May 13, 2021 at 12:32:26PM +0100, Richard Sandiford wrote: >> > >> Jakub Jelinek <jakub@redhat.com> writes: >> > >> > On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote: >> > >> >> > > Bootstrapped and regtested on X86_64-linux-gnu{-m32,} >> > >> >> > > Ok for trunk? >> > >> >> > >> > >> >> > Some time ago a support for CLOBBER_HIGH RTX was added (and later >> > >> >> > removed for some reason). Perhaps we could resurrect the patch for the >> > >> >> > purpose of ferrying 128bit modes via vzeroupper RTX? >> > >> >> >> > >> >> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html >> > >> > >> > >> > https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html >> > >> > is where it got removed, CCing Richard. >> > >> >> > >> Yeah. Initially clobber_high seemed like the best appraoch for >> > >> handling the tlsdesc thing, but in practice it was too difficult >> > >> to shoe-horn the concept in after the fact, when so much rtl >> > >> infrastructure wasn't prepared to deal with it. The old support >> > >> didn't handle all cases and passes correctly, and handled others >> > >> suboptimally. >> > >> >> > >> I think it would be worth using the same approach as >> > >> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for >> > >> vzeroupper: represent the instructions as call_insns in which the >> > >> call has a special vzeroupper ABI. I think that's likely to lead >> > >> to better code than clobber_high would (or at least, it did for tlsdesc). >> >> From an implementation perspective, I guess you're meaning we should >> implement TARGET_INSN_CALLEE_ABI and TARGET_FNTYPE_ABI in the i386 >> backend. >> > When I implemented the vzeroupper pattern as call_insn and defined > TARGET_INSN_CALLEE_ABI for it, I got several failures. they're related > to 2 parts > > 1. requires_stack_frame_p return true for vzeroupper which should be false. > 2. in subst_stack_regs, vzeroupper shouldn't kill arguments > > I've tried a rough patch like below, it works for those failures, > unfortunately, I don't have an arm machine to test, so I want to ask > would the below change break something in the arm backend? ABI id 0 just means the default ABI. Real calls can use other ABIs besides the default. That said… > modified gcc/reg-stack.c > @@ -174,6 +174,7 @@ > #include "reload.h" > #include "tree-pass.h" > #include "rtl-iter.h" > +#include "function-abi.h" > > #ifdef STACK_REGS > > @@ -2385,7 +2386,7 @@ subst_stack_regs (rtx_insn *insn, stack_ptr regstack) > bool control_flow_insn_deleted = false; > int i; > > - if (CALL_P (insn)) > + if (CALL_P (insn) && insn_callee_abi (insn).id () == 0) > { > int top = regstack->top; …reg-stack.c is effectively x86-specific code, so checking id 0 here wouldn't affect anything else. It doesn't feel very future-proof though, since x86 could use ABIs other than 0 for real calls in future. AIUI the property that matters here isn't the ABI, but that the target of the call doesn't reference stack registers. That can be true for real calls too, with -fipa-ra. > modified gcc/shrink-wrap.c > @@ -58,7 +58,12 @@ requires_stack_frame_p (rtx_insn *insn, > HARD_REG_SET prologue_used, > unsigned regno; > > if (CALL_P (insn)) > - return !SIBLING_CALL_P (insn); > + { > + if (insn_callee_abi (insn).id() != 0) > + return false; > + else > + return !SIBLING_CALL_P (insn); > + } TBH I'm not sure why off-hand this function needs to treat non-sibling calls specially, rather than rely on normal DF information. Calls have a use of the stack pointer, so we should return true for that reason: /* The stack ptr is used (honorarily) by a CALL insn. */ df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i], NULL, bb, insn_info, DF_REF_REG_USE, DF_REF_CALL_STACK_USAGE | flags); I guess this is something we should suppress for fake calls though. It looks like the rtx “used” flag is unused for INSNs, so we could use that as a CALL_INSN flag that indicates a fake call. We could just need to make: /* For all other RTXes clear the used flag on the copy. */ RTX_FLAG (copy, used) = 0; conditional on !INSN_P. Thanks, Richard
On Mon, May 17, 2021 at 5:56 PM Richard Sandiford <richard.sandiford@arm.com> wrote: > > Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > > On Fri, May 14, 2021 at 10:27 AM Hongtao Liu <crazylht@gmail.com> wrote: > >> > >> On Thu, May 13, 2021 at 7:52 PM Richard Sandiford > >> <richard.sandiford@arm.com> wrote: > >> > > >> > Jakub Jelinek <jakub@redhat.com> writes: > >> > > On Thu, May 13, 2021 at 12:32:26PM +0100, Richard Sandiford wrote: > >> > >> Jakub Jelinek <jakub@redhat.com> writes: > >> > >> > On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote: > >> > >> >> > > Bootstrapped and regtested on X86_64-linux-gnu{-m32,} > >> > >> >> > > Ok for trunk? > >> > >> >> > > >> > >> >> > Some time ago a support for CLOBBER_HIGH RTX was added (and later > >> > >> >> > removed for some reason). Perhaps we could resurrect the patch for the > >> > >> >> > purpose of ferrying 128bit modes via vzeroupper RTX? > >> > >> >> > >> > >> >> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html > >> > >> > > >> > >> > https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html > >> > >> > is where it got removed, CCing Richard. > >> > >> > >> > >> Yeah. Initially clobber_high seemed like the best appraoch for > >> > >> handling the tlsdesc thing, but in practice it was too difficult > >> > >> to shoe-horn the concept in after the fact, when so much rtl > >> > >> infrastructure wasn't prepared to deal with it. The old support > >> > >> didn't handle all cases and passes correctly, and handled others > >> > >> suboptimally. > >> > >> > >> > >> I think it would be worth using the same approach as > >> > >> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for > >> > >> vzeroupper: represent the instructions as call_insns in which the > >> > >> call has a special vzeroupper ABI. I think that's likely to lead > >> > >> to better code than clobber_high would (or at least, it did for tlsdesc). > >> > >> From an implementation perspective, I guess you're meaning we should > >> implement TARGET_INSN_CALLEE_ABI and TARGET_FNTYPE_ABI in the i386 > >> backend. > >> > > When I implemented the vzeroupper pattern as call_insn and defined > > TARGET_INSN_CALLEE_ABI for it, I got several failures. they're related > > to 2 parts > > > > 1. requires_stack_frame_p return true for vzeroupper which should be false. > > 2. in subst_stack_regs, vzeroupper shouldn't kill arguments > > > > I've tried a rough patch like below, it works for those failures, > > unfortunately, I don't have an arm machine to test, so I want to ask > > would the below change break something in the arm backend? > > ABI id 0 just means the default ABI. Real calls can use other ABIs > besides the default. That said… > > > modified gcc/reg-stack.c > > @@ -174,6 +174,7 @@ > > #include "reload.h" > > #include "tree-pass.h" > > #include "rtl-iter.h" > > +#include "function-abi.h" > > > > #ifdef STACK_REGS > > > > @@ -2385,7 +2386,7 @@ subst_stack_regs (rtx_insn *insn, stack_ptr regstack) > > bool control_flow_insn_deleted = false; > > int i; > > > > - if (CALL_P (insn)) > > + if (CALL_P (insn) && insn_callee_abi (insn).id () == 0) > > { > > int top = regstack->top; > > …reg-stack.c is effectively x86-specific code, so checking id 0 here > wouldn't affect anything else. It doesn't feel very future-proof > though, since x86 could use ABIs other than 0 for real calls in future. > > AIUI the property that matters here isn't the ABI, but that the target > of the call doesn't reference stack registers. That can be true for > real calls too, with -fipa-ra. > > > modified gcc/shrink-wrap.c > > @@ -58,7 +58,12 @@ requires_stack_frame_p (rtx_insn *insn, > > HARD_REG_SET prologue_used, > > unsigned regno; > > > > if (CALL_P (insn)) > > - return !SIBLING_CALL_P (insn); > > + { > > + if (insn_callee_abi (insn).id() != 0) > > + return false; > > + else > > + return !SIBLING_CALL_P (insn); > > + } > > TBH I'm not sure why off-hand this function needs to treat non-sibling > calls specially, rather than rely on normal DF information. Calls have > a use of the stack pointer, so we should return true for that reason: > > /* The stack ptr is used (honorarily) by a CALL insn. */ > df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i], > NULL, bb, insn_info, DF_REF_REG_USE, > DF_REF_CALL_STACK_USAGE | flags); > > I guess this is something we should suppress for fake calls though. > > It looks like the rtx “used” flag is unused for INSNs, so we could > use that as a CALL_INSN flag that indicates a fake call. We could just > need to make: > > /* For all other RTXes clear the used flag on the copy. */ > RTX_FLAG (copy, used) = 0; > > conditional on !INSN_P. > I got another error in @@ -83,6 +83,9 @@ control_flow_insn_p (const rtx_insn *insn) return true; case CALL_INSN: + /* CALL_INSN use "used" flag to indicate it's a fake call. */ + if (RTX_FLAG (insn, used)) + break; and performance issue in modified gcc/final.c @@ -4498,7 +4498,8 @@ leaf_function_p (void) for (insn = get_insns (); insn; insn = NEXT_INSN (insn)) { if (CALL_P (insn) - && ! SIBLING_CALL_P (insn)) + && ! SIBLING_CALL_P (insn) + && !RTX_FLAG (insn, used)) return 0; if (NONJUMP_INSN_P (insn) Also i grep CALL_P or CALL_INSN in GCC source codes, there are many places which hold the assumption CALL_P/CALL_INSN is a real call. Considering that vzeroupper is used a lot on the i386 backend, I'm a bit worried that this implementation solution will be a bottomless pit. > Thanks, > Richard
Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > On Mon, May 17, 2021 at 5:56 PM Richard Sandiford > <richard.sandiford@arm.com> wrote: >> It looks like the rtx “used” flag is unused for INSNs, so we could >> use that as a CALL_INSN flag that indicates a fake call. We could just >> need to make: >> >> /* For all other RTXes clear the used flag on the copy. */ >> RTX_FLAG (copy, used) = 0; >> >> conditional on !INSN_P. >> > I got another error in > > @@ -83,6 +83,9 @@ control_flow_insn_p (const rtx_insn *insn) > return true; > > case CALL_INSN: > + /* CALL_INSN use "used" flag to indicate it's a fake call. */ > + if (RTX_FLAG (insn, used)) > + break; I guess this is because of the nonlocal_goto condition? If so, that could be fixed by adding a REG_EH_REGION note of INT_MIN. Even if we don't do that, I think the fix belongs in nonlocal_goto instead. > and performance issue in > > modified gcc/final.c > @@ -4498,7 +4498,8 @@ leaf_function_p (void) > for (insn = get_insns (); insn; insn = NEXT_INSN (insn)) > { > if (CALL_P (insn) > - && ! SIBLING_CALL_P (insn)) > + && ! SIBLING_CALL_P (insn) > + && !RTX_FLAG (insn, used)) > return 0; > if (NONJUMP_INSN_P (insn) > > Also i grep CALL_P or CALL_INSN in GCC source codes, there are many > places which hold the assumption CALL_P/CALL_INSN is a real call. > Considering that vzeroupper is used a lot on the i386 backend, I'm a > bit worried that this implementation solution will be a bottomless > pit. Maybe, but I think the same is true for CLOBBER_HIGH. If we have a third alternative then we should consider it, but I think the call approach is still going to be less problematic then CLOBBER_HIGH. The main advantage of the call approach is that the CALL_P handling is (mostly) conservatively correct and performance problems are just a one-line change. The CLOBBER_HIGH approach instead requires changes to the way that passes track liveness information for non-call instructions (so is much more than a one-line change). Also, treating a CLOBBER_HIGH like a CLOBBER isn't conservatively correct, because other code might be relying on part of the register being preserved. Thanks, Richard
On Tue, May 18, 2021 at 11:18 PM Richard Sandiford <richard.sandiford@arm.com> wrote: > > Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > > On Mon, May 17, 2021 at 5:56 PM Richard Sandiford > > <richard.sandiford@arm.com> wrote: > >> It looks like the rtx “used” flag is unused for INSNs, so we could > >> use that as a CALL_INSN flag that indicates a fake call. We could just > >> need to make: > >> > >> /* For all other RTXes clear the used flag on the copy. */ > >> RTX_FLAG (copy, used) = 0; > >> > >> conditional on !INSN_P. > >> > > I got another error in > > > > @@ -83,6 +83,9 @@ control_flow_insn_p (const rtx_insn *insn) > > return true; > > > > case CALL_INSN: > > + /* CALL_INSN use "used" flag to indicate it's a fake call. */ > > + if (RTX_FLAG (insn, used)) > > + break; > > I guess this is because of the nonlocal_goto condition? If so, that > could be fixed by adding a REG_EH_REGION note of INT_MIN. Even if we > don't do that, I think the fix belongs in nonlocal_goto instead. > This is error info, IMHO, the fix should be in control_flow_insn_p? ../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c: In function ‘foo’: ../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:21:1: error: in basic block 5: 21 | } | ^ ../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:21:1: error: flow control insn inside a basic block (call_insn 77 50 86 5 (parallel [ (call (mem:QI (unspec_volatile [ (const_int 0 [0]) ] UNSPECV_VZEROUPPER) [0 S1 A8]) (const_int 0 [0])) (unspec [ (const_int 1 [0x1]) ] UNSPEC_CALLEE_ABI) ]) -1 (nil) (nil)) during RTL pass: pro_and_epilogue ../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:21:1: internal compiler error: in rtl_verify_bb_insns, at cfgrtl.c:2797 0x129a2a3 _fatal_insn(char const*, rtx_def const*, char const*, int, char const*) /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/rtl-error.c:108 0xcb8834 rtl_verify_bb_insns /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:2797 0xcb8b09 rtl_verify_flow_info_1 /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:2883 0xcb9284 rtl_verify_flow_info /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:3125 0xc9f44d verify_flow_info() /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfghooks.c:267 0xcb21b7 checking_verify_flow_info /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfghooks.h:212 0xcb6a3c commit_edge_insertions() /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:2115 0xebfcb8 thread_prologue_and_epilogue_insns() /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/function.c:6136 0xec07db rest_of_handle_thread_prologue_and_epilogue /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/function.c:6510 0xec09b8 execute /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/function.c:6586 > > Thanks, > Richard
On Tue, May 25, 2021 at 2:04 PM Hongtao Liu <crazylht@gmail.com> wrote: > > On Tue, May 18, 2021 at 11:18 PM Richard Sandiford > <richard.sandiford@arm.com> wrote: > > > > Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > > > On Mon, May 17, 2021 at 5:56 PM Richard Sandiford > > > <richard.sandiford@arm.com> wrote: > > >> It looks like the rtx “used” flag is unused for INSNs, so we could > > >> use that as a CALL_INSN flag that indicates a fake call. We could just > > >> need to make: > > >> > > >> /* For all other RTXes clear the used flag on the copy. */ > > >> RTX_FLAG (copy, used) = 0; > > >> > > >> conditional on !INSN_P. > > >> > > > I got another error in > > > > > > @@ -83,6 +83,9 @@ control_flow_insn_p (const rtx_insn *insn) > > > return true; > > > > > > case CALL_INSN: > > > + /* CALL_INSN use "used" flag to indicate it's a fake call. */ > > > + if (RTX_FLAG (insn, used)) > > > + break; > > > > I guess this is because of the nonlocal_goto condition? If so, that Oh, I guess you're meaning can_nonlocal_goto which is inside constrol_flow_insn_p. Sorry for disturbing you. > > could be fixed by adding a REG_EH_REGION note of INT_MIN. Even if we > > don't do that, I think the fix belongs in nonlocal_goto instead. > > > This is error info, IMHO, the fix should be in control_flow_insn_p? > > ../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c: > In function ‘foo’: > ../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:21:1: > error: in basic block 5: > 21 | } > | ^ > ../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:21:1: > error: flow control insn inside a basic block > (call_insn 77 50 86 5 (parallel [ > (call (mem:QI (unspec_volatile [ > (const_int 0 [0]) > ] UNSPECV_VZEROUPPER) [0 S1 A8]) > (const_int 0 [0])) > (unspec [ > (const_int 1 [0x1]) > ] UNSPEC_CALLEE_ABI) > ]) -1 > (nil) > (nil)) > during RTL pass: pro_and_epilogue > ../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:21:1: > internal compiler error: in rtl_verify_bb_insns, at cfgrtl.c:2797 > 0x129a2a3 _fatal_insn(char const*, rtx_def const*, char const*, int, > char const*) > /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/rtl-error.c:108 > 0xcb8834 rtl_verify_bb_insns > /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:2797 > 0xcb8b09 rtl_verify_flow_info_1 > /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:2883 > 0xcb9284 rtl_verify_flow_info > /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:3125 > 0xc9f44d verify_flow_info() > /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfghooks.c:267 > 0xcb21b7 checking_verify_flow_info > /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfghooks.h:212 > 0xcb6a3c commit_edge_insertions() > /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:2115 > 0xebfcb8 thread_prologue_and_epilogue_insns() > /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/function.c:6136 > 0xec07db rest_of_handle_thread_prologue_and_epilogue > /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/function.c:6510 > 0xec09b8 execute > /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/function.c:6586 > > > > Thanks, > > Richard > > > > -- > BR, > Hongtao
Hi: This is an updated patch which implements vzeroupper as call_insn which has a special vzeroupper ABI, also in this patch i reverted r11-7684, r10-6451, r10-3677 which seems to fix the same issue but in a different way. Bootstrapped and regtested on x86_64-linux-gnux{-m32,} and x86_64-linux-gnux{-m32 \-march=cascadelake,-march=cascadelake}. Also test the patch on SPEC2017 and eembc, no performance impact as expected. Ok for trunk? gcc/ChangeLog: PR target/82735 * config/i386/i386-expand.c (ix86_expand_builtin): Remove assignment of cfun->machine->has_explicit_vzeroupper. * config/i386/i386-features.c (ix86_add_reg_usage_to_vzerouppers): Delete. (ix86_add_reg_usage_to_vzeroupper): Ditto. (rest_of_handle_insert_vzeroupper): Remove ix86_add_reg_usage_to_vzerouppers, add df_analyze at the end of the function. (gate): Remove cfun->machine->has_explicit_vzeroupper. * config/i386/i386-protos.h (ix86_expand_avx_vzeroupper): Declared. * config/i386/i386.c (ix86_insn_callee_abi): New function. (ix86_initialize_callee_abi): Ditto. (ix86_expand_avx_vzeroupper): Ditto. (ix86_hard_regno_call_part_clobbered): Adjust for vzeroupper ABI. (TARGET_INSN_CALLEE_ABI): Define as ix86_insn_callee_abi. * config/i386/i386.h (enum i386_insn_callee_abi_index): New. (struct GTY(()) machine_function): Delete has_explicit_vzeroupper. * config/i386/i386.md (enum unspec): New member UNSPEC_CALLEE_ABI. * config/i386/predicates.md (vzeroupper_pattern): Adjust. * config/i386/sse.md (avx_vzeroupper): Call ix86_expand_avx_vzeroupper. (*avx_vzeroupper): Rename to .. (avx_vzeroupper_callee_abi): .. this, and adjust pattern as call_insn which has a special vzeroupper ABI. (*avx_vzeroupper_1): Deleted. * df-scan.c (df_get_call_refs): When call_insn is a fake call, it won't use stack pointer reg. * final.c (leaf_function_p): When call_insn is a fake call, it won't affect caller as a leaf function. * reg-stack.c (callee_clobbers_any_stack_reg): New. (subst_stack_regs): When call_insn doesn't clobber any stack reg, don't clear the arguments. * rtl.c (shallow_copy_rtx): Don't clear flag used when orig is a insn. * shrink-wrap.c (requires_stack_frame_p): No need for stack frame for a fake call. gcc/testsuite/ChangeLog: PR target/82735 * gcc.target/i386/pr82735-1.c: New test. * gcc.target/i386/pr82735-2.c: New test. * gcc.target/i386/pr82735-3.c: New test. * gcc.target/i386/pr82735-4.c: New test. * gcc.target/i386/pr82735-5.c: New test.
On Thu, May 27, 2021 at 7:03 AM Hongtao Liu <crazylht@gmail.com> wrote: > > Hi: > This is an updated patch which implements vzeroupper as call_insn > which has a special vzeroupper ABI, also in this patch i reverted > r11-7684, r10-6451, r10-3677 which seems to fix the same issue but in > a different way. > Bootstrapped and regtested on x86_64-linux-gnux{-m32,} and > x86_64-linux-gnux{-m32 \-march=cascadelake,-march=cascadelake}. > Also test the patch on SPEC2017 and eembc, no performance impact as expected. > Ok for trunk? > > gcc/ChangeLog: > > PR target/82735 > * config/i386/i386-expand.c (ix86_expand_builtin): Remove > assignment of cfun->machine->has_explicit_vzeroupper. > * config/i386/i386-features.c > (ix86_add_reg_usage_to_vzerouppers): Delete. > (ix86_add_reg_usage_to_vzeroupper): Ditto. > (rest_of_handle_insert_vzeroupper): Remove > ix86_add_reg_usage_to_vzerouppers, add df_analyze at the end > of the function. > (gate): Remove cfun->machine->has_explicit_vzeroupper. > * config/i386/i386-protos.h (ix86_expand_avx_vzeroupper): > Declared. > * config/i386/i386.c (ix86_insn_callee_abi): New function. > (ix86_initialize_callee_abi): Ditto. > (ix86_expand_avx_vzeroupper): Ditto. > (ix86_hard_regno_call_part_clobbered): Adjust for vzeroupper > ABI. > (TARGET_INSN_CALLEE_ABI): Define as ix86_insn_callee_abi. > * config/i386/i386.h (enum i386_insn_callee_abi_index): New. > (struct GTY(()) machine_function): Delete > has_explicit_vzeroupper. > * config/i386/i386.md (enum unspec): New member > UNSPEC_CALLEE_ABI. > * config/i386/predicates.md (vzeroupper_pattern): Adjust. > * config/i386/sse.md (avx_vzeroupper): Call > ix86_expand_avx_vzeroupper. > (*avx_vzeroupper): Rename to .. > (avx_vzeroupper_callee_abi): .. this, and adjust pattern as > call_insn which has a special vzeroupper ABI. > (*avx_vzeroupper_1): Deleted. > * df-scan.c (df_get_call_refs): When call_insn is a fake call, > it won't use stack pointer reg. > * final.c (leaf_function_p): When call_insn is a fake call, it > won't affect caller as a leaf function. > * reg-stack.c (callee_clobbers_any_stack_reg): New. > (subst_stack_regs): When call_insn doesn't clobber any stack > reg, don't clear the arguments. > * rtl.c (shallow_copy_rtx): Don't clear flag used when orig is > a insn. > * shrink-wrap.c (requires_stack_frame_p): No need for stack > frame for a fake call. > > gcc/testsuite/ChangeLog: > > PR target/82735 > * gcc.target/i386/pr82735-1.c: New test. > * gcc.target/i386/pr82735-2.c: New test. > * gcc.target/i386/pr82735-3.c: New test. > * gcc.target/i386/pr82735-4.c: New test. > * gcc.target/i386/pr82735-5.c: New test. Please split the patch to middle-end and target part. The middle-end should be approved first. (define_expand "avx_vzeroupper" - [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] - "TARGET_AVX") + [(parallel [(call (mem:QI (unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)) + (const_int 0)) + (unspec [(const_int 1)] UNSPEC_CALLEE_ABI)])] The call insn doesn't look like a valid RTX. Why not just: + [(parallel [(call (mem:QI (const_int 0) + (const_int 0)) for a fake call? Also, UNSPEC_VZEROUPPER can be removed this way since the const_int 1 of UNSPEC_CALLEE_ABI is now used to detect vzeroupper. Also, you don't need the avx_vzeroupper pattern to just call ix86_expand_avx_vzeroupper. Just call the function directly from the call site: case AVX_U128: if (mode == AVX_U128_CLEAN) emit_insn (gen_avx_vzeroupper ()); break; + (unspec [(const_int 1)] UNSPEC_CALLEE_ABI)])] Can this const_int 1 be somehow more descriptive? Perhaps use define_constant to define I386_VZEROUPPER ABI and use it in .md as well as .c files. Uros.
On Thu, May 27, 2021 at 01:07:09PM +0800, Hongtao Liu via Gcc-patches wrote: > + /* Flag used for call_insn indicates it's a fake call. */ > + RTX_FLAG (insn, used) = 1; > + /* CALL_INSN use "used" flag to indicate it's a fake call. */ > + if (i == STACK_POINTER_REGNUM > + && !RTX_FLAG (insn_info->insn, used)) > - && ! SIBLING_CALL_P (insn)) > + && ! SIBLING_CALL_P (insn) > + && !RTX_FLAG (insn, used)) > - /* For all other RTXes clear the used flag on the copy. */ > - RTX_FLAG (copy, used) = 0; > + /* For all other RTXes clear the used flag on the copy. > + CALL_INSN use "used" flag to indicate it's a fake call. */ > + if (!INSN_P (orig)) > + RTX_FLAG (copy, used) = 0; > break; > } > return copy; > @@ -57,7 +57,8 @@ requires_stack_frame_p (rtx_insn *insn, HARD_REG_SET prologue_used, > HARD_REG_SET hardregs; > unsigned regno; > > - if (CALL_P (insn)) > + /* CALL_INSN use "used" flag to indicate it's a fake call. */ > + if (CALL_P (insn) && !RTX_FLAG (insn, used)) > return !SIBLING_CALL_P (insn); Please define a macro for this in rtl.h (and mention it above used; member too in a comment, see all the other comments in there), like: /* 1 if RTX is a call_insn for a fake call. */ #define FAKE_CALL_P(RTX) \ (RTL_FLAG_CHECK1 ("FAKE_CALL_P", (RTX), CALL_INSN)->used) Though, I'm also not sure if used can be actually used for this, because it is used e.g. in emit-rtl.c for verification of RTL sharing. Though, it seems no other rtl flag is free for CALL_INSN. Could this fake call flag sit on the CALL rtx instead? Jakub
Jakub Jelinek <jakub@redhat.com> writes: > On Thu, May 27, 2021 at 01:07:09PM +0800, Hongtao Liu via Gcc-patches wrote: >> + /* Flag used for call_insn indicates it's a fake call. */ >> + RTX_FLAG (insn, used) = 1; > >> + /* CALL_INSN use "used" flag to indicate it's a fake call. */ >> + if (i == STACK_POINTER_REGNUM >> + && !RTX_FLAG (insn_info->insn, used)) > >> - && ! SIBLING_CALL_P (insn)) >> + && ! SIBLING_CALL_P (insn) >> + && !RTX_FLAG (insn, used)) > >> - /* For all other RTXes clear the used flag on the copy. */ >> - RTX_FLAG (copy, used) = 0; >> + /* For all other RTXes clear the used flag on the copy. >> + CALL_INSN use "used" flag to indicate it's a fake call. */ >> + if (!INSN_P (orig)) >> + RTX_FLAG (copy, used) = 0; >> break; >> } >> return copy; >> @@ -57,7 +57,8 @@ requires_stack_frame_p (rtx_insn *insn, HARD_REG_SET prologue_used, >> HARD_REG_SET hardregs; >> unsigned regno; >> >> - if (CALL_P (insn)) >> + /* CALL_INSN use "used" flag to indicate it's a fake call. */ >> + if (CALL_P (insn) && !RTX_FLAG (insn, used)) >> return !SIBLING_CALL_P (insn); > > Please define a macro for this in rtl.h (and mention it above used; > member too in a comment, see all the other comments in there), like: > /* 1 if RTX is a call_insn for a fake call. */ > #define FAKE_CALL_P(RTX) \ > (RTL_FLAG_CHECK1 ("FAKE_CALL_P", (RTX), CALL_INSN)->used) > Though, I'm also not sure if used can be actually used for this, > because it is used e.g. in emit-rtl.c for verification of RTL sharing. I thought it should be OK, since: - copy_rtx_if_shared_1 and mark_used_flags do nothing for insns - verify_rtx_sharing is only called for parts of an insn, rather than an insn itself I guess an alternative would be to add a new rtx_code for fake call insns and use CALL_P to test for both. However, that would lose the property that the default behaviour is conservatively correct (even for direct checks of CALL_INSN), so the flag IMO seems better. Thanks, Richard > Though, it seems no other rtl flag is free for CALL_INSN. > Could this fake call flag sit on the CALL rtx instead? > > Jakub
On Thu, May 27, 2021 at 6:50 PM Richard Sandiford <richard.sandiford@arm.com> wrote: > > Jakub Jelinek <jakub@redhat.com> writes: > > On Thu, May 27, 2021 at 01:07:09PM +0800, Hongtao Liu via Gcc-patches wrote: > >> + /* Flag used for call_insn indicates it's a fake call. */ > >> + RTX_FLAG (insn, used) = 1; > > > >> + /* CALL_INSN use "used" flag to indicate it's a fake call. */ > >> + if (i == STACK_POINTER_REGNUM > >> + && !RTX_FLAG (insn_info->insn, used)) > > > >> - && ! SIBLING_CALL_P (insn)) > >> + && ! SIBLING_CALL_P (insn) > >> + && !RTX_FLAG (insn, used)) > > > >> - /* For all other RTXes clear the used flag on the copy. */ > >> - RTX_FLAG (copy, used) = 0; > >> + /* For all other RTXes clear the used flag on the copy. > >> + CALL_INSN use "used" flag to indicate it's a fake call. */ > >> + if (!INSN_P (orig)) > >> + RTX_FLAG (copy, used) = 0; > >> break; > >> } > >> return copy; > >> @@ -57,7 +57,8 @@ requires_stack_frame_p (rtx_insn *insn, HARD_REG_SET prologue_used, > >> HARD_REG_SET hardregs; > >> unsigned regno; > >> > >> - if (CALL_P (insn)) > >> + /* CALL_INSN use "used" flag to indicate it's a fake call. */ > >> + if (CALL_P (insn) && !RTX_FLAG (insn, used)) > >> return !SIBLING_CALL_P (insn); > > > > Please define a macro for this in rtl.h (and mention it above used; > > member too in a comment, see all the other comments in there), like: > > /* 1 if RTX is a call_insn for a fake call. */ > > #define FAKE_CALL_P(RTX) \ > > (RTL_FLAG_CHECK1 ("FAKE_CALL_P", (RTX), CALL_INSN)->used) Changed. > > Though, I'm also not sure if used can be actually used for this, > > because it is used e.g. in emit-rtl.c for verification of RTL sharing. > > I thought it should be OK, since: > > - copy_rtx_if_shared_1 and mark_used_flags do nothing for insns > - verify_rtx_sharing is only called for parts of an insn, rather than > an insn itself > > I guess an alternative would be to add a new rtx_code for fake call > insns and use CALL_P to test for both. However, that would lose the > property that the default behaviour is conservatively correct > (even for direct checks of CALL_INSN), so the flag IMO seems better. > > Thanks, > Richard > > > Though, it seems no other rtl flag is free for CALL_INSN. > > Could this fake call flag sit on the CALL rtx instead? > > > > Jakub Updated separate patch for the middle-end part.
On Thu, May 27, 2021 at 3:05 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Thu, May 27, 2021 at 7:03 AM Hongtao Liu <crazylht@gmail.com> wrote: > > > > Hi: > > This is an updated patch which implements vzeroupper as call_insn > > which has a special vzeroupper ABI, also in this patch i reverted > > r11-7684, r10-6451, r10-3677 which seems to fix the same issue but in > > a different way. > > Bootstrapped and regtested on x86_64-linux-gnux{-m32,} and > > x86_64-linux-gnux{-m32 \-march=cascadelake,-march=cascadelake}. > > Also test the patch on SPEC2017 and eembc, no performance impact as expected. > > Ok for trunk? > > > > gcc/ChangeLog: > > > > PR target/82735 > > * config/i386/i386-expand.c (ix86_expand_builtin): Remove > > assignment of cfun->machine->has_explicit_vzeroupper. > > * config/i386/i386-features.c > > (ix86_add_reg_usage_to_vzerouppers): Delete. > > (ix86_add_reg_usage_to_vzeroupper): Ditto. > > (rest_of_handle_insert_vzeroupper): Remove > > ix86_add_reg_usage_to_vzerouppers, add df_analyze at the end > > of the function. > > (gate): Remove cfun->machine->has_explicit_vzeroupper. > > * config/i386/i386-protos.h (ix86_expand_avx_vzeroupper): > > Declared. > > * config/i386/i386.c (ix86_insn_callee_abi): New function. > > (ix86_initialize_callee_abi): Ditto. > > (ix86_expand_avx_vzeroupper): Ditto. > > (ix86_hard_regno_call_part_clobbered): Adjust for vzeroupper > > ABI. > > (TARGET_INSN_CALLEE_ABI): Define as ix86_insn_callee_abi. > > * config/i386/i386.h (enum i386_insn_callee_abi_index): New. > > (struct GTY(()) machine_function): Delete > > has_explicit_vzeroupper. > > * config/i386/i386.md (enum unspec): New member > > UNSPEC_CALLEE_ABI. > > * config/i386/predicates.md (vzeroupper_pattern): Adjust. > > * config/i386/sse.md (avx_vzeroupper): Call > > ix86_expand_avx_vzeroupper. > > (*avx_vzeroupper): Rename to .. > > (avx_vzeroupper_callee_abi): .. this, and adjust pattern as > > call_insn which has a special vzeroupper ABI. > > (*avx_vzeroupper_1): Deleted. > > * df-scan.c (df_get_call_refs): When call_insn is a fake call, > > it won't use stack pointer reg. > > * final.c (leaf_function_p): When call_insn is a fake call, it > > won't affect caller as a leaf function. > > * reg-stack.c (callee_clobbers_any_stack_reg): New. > > (subst_stack_regs): When call_insn doesn't clobber any stack > > reg, don't clear the arguments. > > * rtl.c (shallow_copy_rtx): Don't clear flag used when orig is > > a insn. > > * shrink-wrap.c (requires_stack_frame_p): No need for stack > > frame for a fake call. > > > > gcc/testsuite/ChangeLog: > > > > PR target/82735 > > * gcc.target/i386/pr82735-1.c: New test. > > * gcc.target/i386/pr82735-2.c: New test. > > * gcc.target/i386/pr82735-3.c: New test. > > * gcc.target/i386/pr82735-4.c: New test. > > * gcc.target/i386/pr82735-5.c: New test. > > Please split the patch to middle-end and target part. The middle-end > should be approved first. > > (define_expand "avx_vzeroupper" > - [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] > - "TARGET_AVX") > + [(parallel [(call (mem:QI (unspec_volatile [(const_int 0)] > UNSPECV_VZEROUPPER)) > + (const_int 0)) > + (unspec [(const_int 1)] UNSPEC_CALLEE_ABI)])] > > The call insn doesn't look like a valid RTX. Why not just: > > + [(parallel [(call (mem:QI (const_int 0) > + (const_int 0)) > > for a fake call? Also, UNSPEC_VZEROUPPER can be removed this way since > the const_int 1 of UNSPEC_CALLEE_ABI is now used to detect vzeroupper. > Changed. > Also, you don't need the avx_vzeroupper pattern to just call > ix86_expand_avx_vzeroupper. Just call the function directly from the > call site: > > case AVX_U128: > if (mode == AVX_U128_CLEAN) > emit_insn (gen_avx_vzeroupper ()); > break; > Changed. > + (unspec [(const_int 1)] UNSPEC_CALLEE_ABI)])] > > Can this const_int 1 be somehow more descriptive? Perhaps use > define_constant to define I386_VZEROUPPER ABI and use it in .md as > well as .c files. Changed. > > Uros. Update separate patch for the backend part. gcc/ChangeLog: PR target/82735 * config/i386/i386-expand.c (ix86_expand_builtin): Remove assignment of cfun->machine->has_explicit_vzeroupper. * config/i386/i386-features.c (ix86_add_reg_usage_to_vzerouppers): Delete. (ix86_add_reg_usage_to_vzeroupper): Ditto. (rest_of_handle_insert_vzeroupper): Remove ix86_add_reg_usage_to_vzerouppers, add df_analyze at the end of the function. (gate): Remove cfun->machine->has_explicit_vzeroupper. * config/i386/i386-protos.h (ix86_expand_avx_vzeroupper): Declared. * config/i386/i386.c (ix86_insn_callee_abi): New function. (ix86_initialize_callee_abi): Ditto. (ix86_expand_avx_vzeroupper): Ditto. (ix86_hard_regno_call_part_clobbered): Adjust for vzeroupper ABI. (TARGET_INSN_CALLEE_ABI): Define as ix86_insn_callee_abi. (ix86_emit_mode_set): Call ix86_expand_avx_vzeroupper directly. * config/i386/i386.h (struct GTY(()) machine_function): Delete has_explicit_vzeroupper. * config/i386/i386.md (enum unspec): New member UNSPEC_CALLEE_ABI. (I386_DEFAULT,I386_VZEROUPPER,I386_UNKNOWN): New define_constants for insn callee abi index. * config/i386/predicates.md (vzeroupper_pattern): Adjust. * config/i386/sse.md (UNSPECV_VZEROUPPER): Deleted. (avx_vzeroupper): Call ix86_expand_avx_vzeroupper. (*avx_vzeroupper): Rename to .. (avx_vzeroupper_callee_abi): .. this, and adjust pattern as call_insn which has a special vzeroupper ABI. (*avx_vzeroupper_1): Deleted. gcc/testsuite/ChangeLog: PR target/82735 * gcc.target/i386/pr82735-1.c: New test. * gcc.target/i386/pr82735-2.c: New test. * gcc.target/i386/pr82735-3.c: New test. * gcc.target/i386/pr82735-4.c: New test. * gcc.target/i386/pr82735-5.c: New test.
On Tue, Jun 1, 2021 at 10:22 AM Hongtao Liu <crazylht@gmail.com> wrote: > > On Thu, May 27, 2021 at 6:50 PM Richard Sandiford > <richard.sandiford@arm.com> wrote: > > > > Jakub Jelinek <jakub@redhat.com> writes: > > > On Thu, May 27, 2021 at 01:07:09PM +0800, Hongtao Liu via Gcc-patches wrote: > > >> + /* Flag used for call_insn indicates it's a fake call. */ > > >> + RTX_FLAG (insn, used) = 1; > > > > > >> + /* CALL_INSN use "used" flag to indicate it's a fake call. */ > > >> + if (i == STACK_POINTER_REGNUM > > >> + && !RTX_FLAG (insn_info->insn, used)) > > > > > >> - && ! SIBLING_CALL_P (insn)) > > >> + && ! SIBLING_CALL_P (insn) > > >> + && !RTX_FLAG (insn, used)) > > > > > >> - /* For all other RTXes clear the used flag on the copy. */ > > >> - RTX_FLAG (copy, used) = 0; > > >> + /* For all other RTXes clear the used flag on the copy. > > >> + CALL_INSN use "used" flag to indicate it's a fake call. */ > > >> + if (!INSN_P (orig)) > > >> + RTX_FLAG (copy, used) = 0; > > >> break; > > >> } > > >> return copy; > > >> @@ -57,7 +57,8 @@ requires_stack_frame_p (rtx_insn *insn, HARD_REG_SET prologue_used, > > >> HARD_REG_SET hardregs; > > >> unsigned regno; > > >> > > >> - if (CALL_P (insn)) > > >> + /* CALL_INSN use "used" flag to indicate it's a fake call. */ > > >> + if (CALL_P (insn) && !RTX_FLAG (insn, used)) > > >> return !SIBLING_CALL_P (insn); > > > > > > Please define a macro for this in rtl.h (and mention it above used; > > > member too in a comment, see all the other comments in there), like: > > > /* 1 if RTX is a call_insn for a fake call. */ > > > #define FAKE_CALL_P(RTX) \ > > > (RTL_FLAG_CHECK1 ("FAKE_CALL_P", (RTX), CALL_INSN)->used) > Changed. > > > Though, I'm also not sure if used can be actually used for this, > > > because it is used e.g. in emit-rtl.c for verification of RTL sharing. > > > > I thought it should be OK, since: > > > > - copy_rtx_if_shared_1 and mark_used_flags do nothing for insns > > - verify_rtx_sharing is only called for parts of an insn, rather than > > an insn itself > > > > I guess an alternative would be to add a new rtx_code for fake call > > insns and use CALL_P to test for both. However, that would lose the > > property that the default behaviour is conservatively correct > > (even for direct checks of CALL_INSN), so the flag IMO seems better. > > > > Thanks, > > Richard > > > > > Though, it seems no other rtl flag is free for CALL_INSN. > > > Could this fake call flag sit on the CALL rtx instead? > > > > > > Jakub > > Updated separate patch for the middle-end part. gcc/ChangeLog PR target/82735 * df-scan.c (df_get_call_refs): When call_insn is a fake call, it won't use stack pointer reg. * final.c (leaf_function_p): When call_insn is a fake call, it won't affect caller as a leaf function. * reg-stack.c (callee_clobbers_any_stack_reg): New. (subst_stack_regs): When call_insn doesn't clobber any stack reg, don't clear the arguments. * rtl.c (shallow_copy_rtx): Don't clear flag used when orig is a insn. * shrink-wrap.c (requires_stack_frame_p): No need for stack frame for a fake call. * rtl.h (FAKE_CALL_P): New macro. > > -- > BR, > Hongtao
From d53b0c6934ea499c9f87df963661b627e7e977bf Mon Sep 17 00:00:00 2001 From: liuhongt <hongtao.liu@intel.com> Date: Wed, 12 May 2021 14:20:54 +0800 Subject: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. When __builtin_ia32_vzeroupper is called explicitly, the corresponding vzeroupper pattern does not carry any CLOBBERS or SETs before LRA, which leads to incorrect optimization in pass_reload. In order to solve this problem, this patch introduces a pre_reload splitter which adds CLOBBERS to vzeroupper's pattern, it can solve the problem in pr. At the same time, in order to optimize the low 128 bits in post_reload CSE, this patch also transforms those CLOBBERS to SETs in pass_vzeroupper. It works fine except for TARGET_64BIT_MS_ABI, under which xmm6-xmm15 are callee-saved, so even if there're no other uses of xmm6-xmm15 in the function, because of vzeroupper's pattern, pro_epilog will save and restore those registers, which is obviously redundant. In order to eliminate this redundancy, a post_reload splitter is introduced, which drops those SETs, until epilogue_completed splitter adds those SETs back, it looks to be safe since there's no CSE between post_reload split2 and epilogue_completed split3??? Also frame info needs to be updated in pro_epilog, which saves and restores xmm6-xmm15 only if there's usage other than explicit vzeroupper pattern. gcc/ChangeLog: PR target/82735 * config/i386/i386-expand.c (ix86_expand_builtin): Count number of __builtin_ia32_vzeroupper. * config/i386/i386-features.c (ix86_add_reg_usage_to_vzerouppers): Transform CLOBBERs to SETs for explict vzeroupper pattern so that CSE can optimize lower 128 bits. * config/i386/i386.c (ix86_handle_explicit_vzeroupper_in_pro_epilog): New. (ix86_save_reg): If there's no use of xmm6~xmm15 other than explicit vzeroupper under TARGET_64BIT_MS_ABI, no need to save REGNO. (ix86_finalize_stack_frame_flags): Recompute frame layout if there's explicit vzeroupper under TARGET_64BIT_MS_ABI. * config/i386/i386.h (struct machine_function): Change type of has_explicit_vzeroupper from BOOL_BITFILED to unsigned int. * config/i386/sse.md (*avx_vzeroupper_2): New post-reload splitter which will drop all SETs for explicit vzeroupper patterns. (*avx_vzeroupper_1): Generate SET reg to reg instead of CLOBBER, and add pre-reload splitter after it. gcc/testsuite/ChangeLog: PR target/82735 * gcc.target/i386/pr82735-1.c: New test. * gcc.target/i386/pr82735-2.c: New test. * gcc.target/i386/pr82735-3.c: New test. * gcc.target/i386/pr82735-4.c: New test. * gcc.target/i386/pr82735-5.c: New test. --- gcc/config/i386/i386-expand.c | 2 +- gcc/config/i386/i386-features.c | 25 ++++++++++- gcc/config/i386/i386.c | 23 ++++++++++ gcc/config/i386/i386.h | 8 ++-- gcc/config/i386/sse.md | 48 +++++++++++++++++++- gcc/testsuite/gcc.target/i386/pr82735-1.c | 29 ++++++++++++ gcc/testsuite/gcc.target/i386/pr82735-2.c | 21 +++++++++ gcc/testsuite/gcc.target/i386/pr82735-3.c | 5 +++ gcc/testsuite/gcc.target/i386/pr82735-4.c | 48 ++++++++++++++++++++ gcc/testsuite/gcc.target/i386/pr82735-5.c | 54 +++++++++++++++++++++++ 10 files changed, 256 insertions(+), 7 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-1.c create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-2.c create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-3.c create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-4.c create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-5.c diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c index fee4d07b7fd..7f3326a12b2 100644 --- a/gcc/config/i386/i386-expand.c +++ b/gcc/config/i386/i386-expand.c @@ -13233,7 +13233,7 @@ rdseed_step: return 0; case IX86_BUILTIN_VZEROUPPER: - cfun->machine->has_explicit_vzeroupper = true; + cfun->machine->has_explicit_vzeroupper++; break; default: diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c index 77783a154b6..6b2179f16cb 100644 --- a/gcc/config/i386/i386-features.c +++ b/gcc/config/i386/i386-features.c @@ -1827,8 +1827,31 @@ ix86_add_reg_usage_to_vzerouppers (void) { if (!NONDEBUG_INSN_P (insn)) continue; + /* Transform CLOBBERs to SETs so that lower 128 bits of sse reisters + will be able to cross vzeroupper in post-reload CSE. */ if (vzeroupper_pattern (PATTERN (insn), VOIDmode)) - ix86_add_reg_usage_to_vzeroupper (insn, live_regs); + { + if (XVECEXP (XVECEXP (PATTERN (insn), 0, 0), 0, 0) == const1_rtx) + { + unsigned int nregs = TARGET_64BIT ? 16 : 8; + rtvec vec = rtvec_alloc (nregs + 1); + RTVEC_ELT (vec, 0) = XVECEXP (PATTERN (insn), 0, 0); + for (unsigned int i = 0; i < nregs; ++i) + { + unsigned int regno = GET_SSE_REGNO (i); + rtx reg = gen_rtx_REG (V2DImode, regno); + RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg); + } + XVEC (PATTERN (insn), 0) = vec; + INSN_CODE (insn) = -1; + df_insn_rescan (insn); + } + else + { + gcc_assert (XVECLEN (PATTERN (insn), 0) == 1); + ix86_add_reg_usage_to_vzeroupper (insn, live_regs); + } + } df_simulate_one_insn_backwards (bb, insn, live_regs); } } diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 780da108a7c..4d4d7dbbc82 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -6170,6 +6170,17 @@ ix86_hard_regno_scratch_ok (unsigned int regno) && df_regs_ever_live_p (regno))); } +/* Return true if explicit usage of __builtin_ia32_vzeroupper + should be specially handled in pro_epilog. */ +static bool +ix86_handle_explicit_vzeroupper_in_pro_epilog () +{ + return (cfun->machine->has_explicit_vzeroupper + && TARGET_64BIT_MS_ABI + && !epilogue_completed + && reload_completed); +} + /* Return TRUE if we need to save REGNO. */ bool @@ -6244,6 +6255,16 @@ ix86_save_reg (unsigned int regno, bool maybe_eh_return, bool ignore_outlined) && !cfun->machine->no_drap_save_restore) return true; + /* If there's no use other than explicit vzeroupper + for xmm6~xmm15 under TARGET_64BIT_MS_ABI, + no need to save REGNO. */ + if (ix86_handle_explicit_vzeroupper_in_pro_epilog () + && (IN_RANGE (regno, FIRST_SSE_REG + 6, LAST_SSE_REG) + || IN_RANGE (regno, FIRST_REX_SSE_REG, LAST_REX_SSE_REG))) + return df_regs_ever_live_p (regno) + ? df_hard_reg_used_count (regno) > cfun->machine->has_explicit_vzeroupper + : false; + return (df_regs_ever_live_p (regno) && !call_used_or_fixed_reg_p (regno) && (regno != HARD_FRAME_POINTER_REGNUM || !frame_pointer_needed)); @@ -8046,6 +8067,8 @@ ix86_finalize_stack_frame_flags (void) recompute_frame_layout_p = true; crtl->stack_realign_needed = stack_realign; crtl->stack_realign_finalized = true; + if (ix86_handle_explicit_vzeroupper_in_pro_epilog ()) + recompute_frame_layout_p = true; if (recompute_frame_layout_p) ix86_compute_frame_layout (); } diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index 97d6f3863cb..c0855a936ac 100644 --- a/gcc/config/i386/i386.h +++ b/gcc/config/i386/i386.h @@ -2654,10 +2654,6 @@ struct GTY(()) machine_function { /* True if the function needs a stack frame. */ BOOL_BITFIELD stack_frame_required : 1; - /* True if __builtin_ia32_vzeroupper () has been expanded in current - function. */ - BOOL_BITFIELD has_explicit_vzeroupper : 1; - /* True if we should act silently, rather than raise an error for invalid calls. */ BOOL_BITFIELD silent_p : 1; @@ -2665,6 +2661,10 @@ struct GTY(()) machine_function { /* The largest alignment, in bytes, of stack slot actually used. */ unsigned int max_used_stack_alignment; + /* Number of __builtin_ia32_vzeroupper () which has been expanded in + current function. */ + unsigned int has_explicit_vzeroupper; + /* During prologue/epilogue generation, the current frame state. Otherwise, the frame state at the end of the prologue. */ struct machine_frame_state fs; diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 897cf3eaea9..489fa02fa20 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -20626,7 +20626,7 @@ (define_insn_and_split "*avx_vzeroupper_1" else { rtx reg = gen_rtx_REG (V2DImode, regno); - RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); + RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg); } } operands[0] = gen_rtx_PARALLEL (VOIDmode, vec); @@ -20638,6 +20638,52 @@ (define_insn_and_split "*avx_vzeroupper_1" (set_attr "btver2_decode" "vector") (set_attr "mode" "OI")]) +(define_split + [(match_parallel 0 "vzeroupper_pattern" + [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] + "TARGET_AVX && ix86_pre_reload_split ()" + [(match_dup 0)] +{ + /* When vzeroupper is explictly used, for LRA purpose, make it clear + the instruction kills sse registers. */ + gcc_assert (cfun->machine->has_explicit_vzeroupper); + unsigned int nregs = TARGET_64BIT ? 16 : 8; + rtvec vec = rtvec_alloc (nregs + 1); + RTVEC_ELT (vec, 0) = gen_rtx_UNSPEC_VOLATILE (VOIDmode, + gen_rtvec (1, const1_rtx), + UNSPECV_VZEROUPPER); + for (unsigned int i = 0; i < nregs; ++i) + { + unsigned int regno = GET_SSE_REGNO (i); + rtx reg = gen_rtx_REG (V2DImode, regno); + RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); + } + operands[0] = gen_rtx_PARALLEL (VOIDmode, vec); +}) + +(define_insn_and_split "*avx_vzeroupper_2" + [(match_parallel 0 "vzeroupper_pattern" + [(unspec_volatile [(const_int 1)] UNSPECV_VZEROUPPER)])] + "TARGET_AVX && XVECLEN (operands[0], 0) == (TARGET_64BIT ? 16 : 8) + 1" + "vzeroupper" + "&& reload_completed && TARGET_64BIT_MS_ABI" + [(const_int 0)] +{ + /* To avoid redundant save and restore in pro_and_epilog, drop + those SETs/CLOBBERs which are added by pre-reload splitter + or pass_vzeroupper, it's safe since there's no CSE optimization + between post-reload split2 and epilogue-completed split3??? */ + gcc_assert (cfun->machine->has_explicit_vzeroupper); + emit_insn (gen_avx_vzeroupper ()); + DONE; +} + [(set_attr "type" "sse") + (set_attr "modrm" "0") + (set_attr "memory" "none") + (set_attr "prefix" "vex") + (set_attr "btver2_decode" "vector") + (set_attr "mode" "OI")]) + (define_mode_attr pbroadcast_evex_isa [(V64QI "avx512bw") (V32QI "avx512bw") (V16QI "avx512bw") (V32HI "avx512bw") (V16HI "avx512bw") (V8HI "avx512bw") diff --git a/gcc/testsuite/gcc.target/i386/pr82735-1.c b/gcc/testsuite/gcc.target/i386/pr82735-1.c new file mode 100644 index 00000000000..1a63b9ae9c9 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr82735-1.c @@ -0,0 +1,29 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -mavx" } */ +/* { dg-require-effective-target avx } */ + +#include "avx-check.h" + +void +__attribute__ ((noipa)) +mtest(char *dest) +{ + __m256i ymm1 = _mm256_set1_epi8((char)0x1); + _mm256_storeu_si256((__m256i *)(dest + 32), ymm1); + _mm256_zeroupper(); + __m256i ymm2 = _mm256_set1_epi8((char)0x1); + _mm256_storeu_si256((__m256i *)dest, ymm2); +} + +void +avx_test () +{ + char buf[64]; + for (int i = 0; i != 64; i++) + buf[i] = 2; + mtest (buf); + + for (int i = 0; i < 32; ++i) + if (buf[i] != 1) + __builtin_abort (); +} diff --git a/gcc/testsuite/gcc.target/i386/pr82735-2.c b/gcc/testsuite/gcc.target/i386/pr82735-2.c new file mode 100644 index 00000000000..48d0d6e983d --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr82735-2.c @@ -0,0 +1,21 @@ +/* { dg-do compile } */ +/* { dg-options "-mavx -O2" } */ + +#include <immintrin.h> + +void test(char *dest) +{ + /* xmm1 can be propagated to xmm2 by CSE. */ + __m128i xmm1 = _mm_set1_epi8((char)0x1); + _mm_storeu_si128((__m128i *)(dest + 32), xmm1); + _mm256_zeroupper(); + _mm256_zeroupper(); + _mm256_zeroupper(); + _mm256_zeroupper(); + _mm256_zeroupper(); + __m128i xmm2 = _mm_set1_epi8((char)0x1); + _mm_storeu_si128((__m128i *)dest, xmm2); +} + +/* Darwin local constant symbol is "lC0", ELF targets ".LC0" */ +/* { dg-final { scan-assembler-times {(?n)vmovdqa\t\.?[Ll]C0[^,]*, %xmm[0-9]} 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr82735-3.c b/gcc/testsuite/gcc.target/i386/pr82735-3.c new file mode 100644 index 00000000000..e3f801e6924 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr82735-3.c @@ -0,0 +1,5 @@ +/* { dg-do compile } */ +/* { dg-options "-mavx -O2 -mabi=ms" } */ +/* { dg-final { scan-assembler-not {(?n)xmm([6-9]|1[0-5])} } } */ + +#include "pr82735-2.c" diff --git a/gcc/testsuite/gcc.target/i386/pr82735-4.c b/gcc/testsuite/gcc.target/i386/pr82735-4.c new file mode 100644 index 00000000000..78c0a6cb2c8 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr82735-4.c @@ -0,0 +1,48 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */ +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */ +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */ + +#include <immintrin.h> + +void test(char *dest) +{ + __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15; + asm volatile ("vmovdqa\t%%ymm0, %0\n\t" + "vmovdqa\t%%ymm0, %1\n\t" + "vmovdqa\t%%ymm0, %2\n\t" + "vmovdqa\t%%ymm0, %3\n\t" + "vmovdqa\t%%ymm0, %4\n\t" + "vmovdqa\t%%ymm0, %5\n\t" + "vmovdqa\t%%ymm0, %6\n\t" + "vmovdqa\t%%ymm0, %7\n\t" + "vmovdqa\t%%ymm0, %8\n\t" + "vmovdqa\t%%ymm0, %9\n\t" + "vmovdqa\t%%ymm0, %10\n\t" + "vmovdqa\t%%ymm0, %11\n\t" + "vmovdqa\t%%ymm0, %12\n\t" + "vmovdqa\t%%ymm0, %13\n\t" + "vmovdqa\t%%ymm0, %14\n\t" + "vmovdqa\t%%ymm0, %15\n\t" + : "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5), + "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10), + "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15), + "=v"(ymm0) + ::); + _mm256_zeroupper(); + _mm256_storeu_si256((__m256i *)dest, ymm1); + _mm256_storeu_si256((__m256i *)(dest + 32), ymm2); + _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3); + _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4); + _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5); + _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6); + _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7); + _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8); + _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9); + _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10); + _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11); + _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12); + _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13); + _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14); + _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15); +} diff --git a/gcc/testsuite/gcc.target/i386/pr82735-5.c b/gcc/testsuite/gcc.target/i386/pr82735-5.c new file mode 100644 index 00000000000..2a58cbe52d0 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr82735-5.c @@ -0,0 +1,54 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */ +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */ +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */ + +#include <immintrin.h> + +void test(char *dest) +{ + __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15; + asm volatile ("vmovdqa\t%%ymm0, %0\n\t" + "vmovdqa\t%%ymm0, %1\n\t" + "vmovdqa\t%%ymm0, %2\n\t" + "vmovdqa\t%%ymm0, %3\n\t" + "vmovdqa\t%%ymm0, %4\n\t" + "vmovdqa\t%%ymm0, %5\n\t" + "vmovdqa\t%%ymm0, %6\n\t" + "vmovdqa\t%%ymm0, %7\n\t" + "vmovdqa\t%%ymm0, %8\n\t" + "vmovdqa\t%%ymm0, %9\n\t" + "vmovdqa\t%%ymm0, %10\n\t" + "vmovdqa\t%%ymm0, %11\n\t" + "vmovdqa\t%%ymm0, %12\n\t" + "vmovdqa\t%%ymm0, %13\n\t" + "vmovdqa\t%%ymm0, %14\n\t" + "vmovdqa\t%%ymm0, %15\n\t" + : "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5), + "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10), + "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15), + "=v"(ymm0) + ::); + _mm256_zeroupper(); + _mm256_zeroupper(); + _mm256_zeroupper(); + _mm256_zeroupper(); + _mm256_zeroupper(); + _mm256_zeroupper(); + _mm256_zeroupper(); + _mm256_storeu_si256((__m256i *)dest, ymm1); + _mm256_storeu_si256((__m256i *)(dest + 32), ymm2); + _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3); + _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4); + _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5); + _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6); + _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7); + _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8); + _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9); + _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10); + _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11); + _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12); + _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13); + _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14); + _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15); +} -- 2.18.1