Message ID | CAFULd4Z88+aey62UENVeSQCzCx+ev7-AYbCgW-ox63qa7R6TtA@mail.gmail.com |
---|---|
State | New |
Headers | show |
Series | [i386] : Improve STV conversion of shifts | expand |
On Wed, Aug 28, 2019 at 5:12 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > Attached patch improves costing for STV shifts and corrects reject > condition for out of range shift count operands. > > 2019-08-28 Uroš Bizjak <ubizjak@gmail.com> > > * config/i386/i386-features.c > (general_scalar_chain::compute_convert_gain): > Correct cost for double-word shifts. > (general_scalar_to_vector_candidate_p): Reject count operands > greater or equal to mode bitsize. > > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}. > > Committed to mainline SVN. Ouch... I mixed up patches and actually committed the patch that removes maximum from cost of sse<->int moves. I can leave the patch for a day, so we can see the effects of the cost change, and if the patch creates problems, I'll revert it. Sorry for the mixup, Uros.
As usual with costing changes, the patch exposes latent problem. The patched compiler tries to generate non-existing DImode move from mask register to XMM register, and ICEs during reload [1]. Attached patch tightens secondary_reload_needed condition and fixes the issue. I'm bootstrapping and regression testing patch, and will submit a formal submission later today. [1] https://gcc.gnu.org/ml/gcc-regression/2019-08/msg00537.html Uros. On Thu, Aug 29, 2019 at 9:53 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Wed, Aug 28, 2019 at 5:12 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > Attached patch improves costing for STV shifts and corrects reject > > condition for out of range shift count operands. > > > > 2019-08-28 Uroš Bizjak <ubizjak@gmail.com> > > > > * config/i386/i386-features.c > > (general_scalar_chain::compute_convert_gain): > > Correct cost for double-word shifts. > > (general_scalar_to_vector_candidate_p): Reject count operands > > greater or equal to mode bitsize. > > > > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}. > > > > Committed to mainline SVN. > > Ouch... I mixed up patches and actually committed the patch that > removes maximum from cost of sse<->int moves. > > I can leave the patch for a day, so we can see the effects of the cost > change, and if the patch creates problems, I'll revert it. > > Sorry for the mixup, > Uros.
And the patch... On Thu, Aug 29, 2019 at 12:00 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > As usual with costing changes, the patch exposes latent problem. The > patched compiler tries to generate non-existing DImode move from mask > register to XMM register, and ICEs during reload [1]. Attached patch > tightens secondary_reload_needed condition and fixes the issue. > > I'm bootstrapping and regression testing patch, and will submit a > formal submission later today. > > [1] https://gcc.gnu.org/ml/gcc-regression/2019-08/msg00537.html > > Uros. > > On Thu, Aug 29, 2019 at 9:53 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > On Wed, Aug 28, 2019 at 5:12 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > Attached patch improves costing for STV shifts and corrects reject > > > condition for out of range shift count operands. > > > > > > 2019-08-28 Uroš Bizjak <ubizjak@gmail.com> > > > > > > * config/i386/i386-features.c > > > (general_scalar_chain::compute_convert_gain): > > > Correct cost for double-word shifts. > > > (general_scalar_to_vector_candidate_p): Reject count operands > > > greater or equal to mode bitsize. > > > > > > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}. > > > > > > Committed to mainline SVN. > > > > Ouch... I mixed up patches and actually committed the patch that > > removes maximum from cost of sse<->int moves. > > > > I can leave the patch for a day, so we can see the effects of the cost > > change, and if the patch creates problems, I'll revert it. > > > > Sorry for the mixup, > > Uros. diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index d2d84eb11663..1c9c719f22a3 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -18306,32 +18306,36 @@ inline_secondary_memory_needed (machine_mode mode, reg_class_t class1, if (FLOAT_CLASS_P (class1) != FLOAT_CLASS_P (class2)) return true; - /* Between mask and general, we have moves no larger than word size. */ - if ((MASK_CLASS_P (class1) != MASK_CLASS_P (class2)) - && (GET_MODE_SIZE (mode) > UNITS_PER_WORD)) - return true; - /* ??? This is a lie. We do have moves between mmx/general, and for mmx/sse2. But by saying we need secondary memory we discourage the register allocator from using the mmx registers unless needed. */ if (MMX_CLASS_P (class1) != MMX_CLASS_P (class2)) return true; + /* Between mask and general, we have moves no larger than word size. */ + if (MASK_CLASS_P (class1) != MASK_CLASS_P (class2)) + { + if (!(INTEGER_CLASS_P (class1) || INTEGER_CLASS_P (class2)) + || GET_MODE_SIZE (mode) > UNITS_PER_WORD) + return true; + } + if (SSE_CLASS_P (class1) != SSE_CLASS_P (class2)) { /* SSE1 doesn't have any direct moves from other classes. */ if (!TARGET_SSE2) return true; + /* Between SSE and general, we have moves no larger than word size. */ + if (!(INTEGER_CLASS_P (class1) || INTEGER_CLASS_P (class2)) + || GET_MODE_SIZE (mode) > UNITS_PER_WORD) + return true; + /* If the target says that inter-unit moves are more expensive than moving through memory, then don't generate them. */ if ((SSE_CLASS_P (class1) && !TARGET_INTER_UNIT_MOVES_FROM_VEC) || (SSE_CLASS_P (class2) && !TARGET_INTER_UNIT_MOVES_TO_VEC)) return true; - - /* Between SSE and general, we have moves no larger than word size. */ - if (GET_MODE_SIZE (mode) > UNITS_PER_WORD) - return true; } return false; @@ -18608,15 +18612,7 @@ ix86_register_move_cost (machine_mode mode, reg_class_t class1_i, if (MMX_CLASS_P (class1) != MMX_CLASS_P (class2)) gcc_unreachable (); - /* Moves between SSE and integer units are expensive. */ if (SSE_CLASS_P (class1) != SSE_CLASS_P (class2)) - - /* ??? By keeping returned value relatively high, we limit the number - of moves between integer and SSE registers for all targets. - Additionally, high value prevents problem with x86_modes_tieable_p(), - where integer modes in SSE registers are not tieable - because of missing QImode and HImode moves to, from or between - MMX/SSE registers. */ return (SSE_CLASS_P (class1) ? ix86_cost->hard_register.sse_to_integer : ix86_cost->hard_register.integer_to_sse);
On Thu, Aug 29, 2019 at 9:54 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Wed, Aug 28, 2019 at 5:12 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > Attached patch improves costing for STV shifts and corrects reject > > condition for out of range shift count operands. > > > > 2019-08-28 Uroš Bizjak <ubizjak@gmail.com> > > > > * config/i386/i386-features.c > > (general_scalar_chain::compute_convert_gain): > > Correct cost for double-word shifts. > > (general_scalar_to_vector_candidate_p): Reject count operands > > greater or equal to mode bitsize. > > > > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}. > > > > Committed to mainline SVN. > > Ouch... I mixed up patches and actually committed the patch that > removes maximum from cost of sse<->int moves. > > I can leave the patch for a day, so we can see the effects of the cost > change, and if the patch creates problems, I'll revert it. Regresses gromacs and namd quite a bit on Haswell, also perl in SPEC 2000. I guess we should try understand why rather than reverting immediately (I'd leave it in at least another few days to get more testers pick up the rev.). Richard. > Sorry for the mixup, > Uros.
On Fri, Aug 30, 2019 at 9:22 AM Richard Biener <richard.guenther@gmail.com> wrote: > > On Thu, Aug 29, 2019 at 9:54 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > On Wed, Aug 28, 2019 at 5:12 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > Attached patch improves costing for STV shifts and corrects reject > > > condition for out of range shift count operands. > > > > > > 2019-08-28 Uroš Bizjak <ubizjak@gmail.com> > > > > > > * config/i386/i386-features.c > > > (general_scalar_chain::compute_convert_gain): > > > Correct cost for double-word shifts. > > > (general_scalar_to_vector_candidate_p): Reject count operands > > > greater or equal to mode bitsize. > > > > > > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}. > > > > > > Committed to mainline SVN. > > > > Ouch... I mixed up patches and actually committed the patch that > > removes maximum from cost of sse<->int moves. > > > > I can leave the patch for a day, so we can see the effects of the cost > > change, and if the patch creates problems, I'll revert it. > > Regresses gromacs and namd quite a bit on Haswell, also perl in SPEC 2000. > I guess we should try understand why rather than reverting immediately > (I'd leave it in at least another few days to get more testers pick up the > rev.). The correct patch is actually [1] (I have updated the subject): 2019-08-28 Uroš Bizjak <ubizjak@gmail.com> * config/i386/i386.c (ix86_register_move_cost): Do not limit the cost of moves to/from XMM register to minimum 8. There is no technical reason to limit the costs to minimum 8 anymore, and several targets (e.g skylake) also claim that moves between SSE and general regs are as cheap as moves between general regs. However, these values were never benchmarked properly due to the mentioned limitation and apparently cause some unwanted secondary effects (lowering the clock). I agree with your proposal to leave the change in the tree some more time. At the end, we could raise the costs for all targets to 8 to effectively revert to the previous state. [1] https://gcc.gnu.org/ml/gcc-patches/2019-08/msg02009.html Uros.
On Fri, Aug 30, 2019 at 09:42:06AM +0200, Uros Bizjak wrote: > On Fri, Aug 30, 2019 at 9:22 AM Richard Biener > <richard.guenther@gmail.com> wrote: > > > > On Thu, Aug 29, 2019 at 9:54 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > On Wed, Aug 28, 2019 at 5:12 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > > > Attached patch improves costing for STV shifts and corrects reject > > > > condition for out of range shift count operands. > > > > > > > > 2019-08-28 Uroš Bizjak <ubizjak@gmail.com> > > > > > > > > * config/i386/i386-features.c > > > > (general_scalar_chain::compute_convert_gain): > > > > Correct cost for double-word shifts. > > > > (general_scalar_to_vector_candidate_p): Reject count operands > > > > greater or equal to mode bitsize. > > > > > > > > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}. > > > > > > > > Committed to mainline SVN. > > > > > > Ouch... I mixed up patches and actually committed the patch that > > > removes maximum from cost of sse<->int moves. > > > > > > I can leave the patch for a day, so we can see the effects of the cost > > > change, and if the patch creates problems, I'll revert it. > > > > Regresses gromacs and namd quite a bit on Haswell, also perl in SPEC 2000. > > I guess we should try understand why rather than reverting immediately > > (I'd leave it in at least another few days to get more testers pick up the > > rev.). > > The correct patch is actually [1] (I have updated the subject): > > 2019-08-28 Uroš Bizjak <ubizjak@gmail.com> > > * config/i386/i386.c (ix86_register_move_cost): Do not > limit the cost of moves to/from XMM register to minimum 8. > > There is no technical reason to limit the costs to minimum 8 anymore, > and several targets (e.g skylake) also claim that moves between SSE > and general regs are as cheap as moves between general regs. However, > these values were never benchmarked properly due to the mentioned > limitation and apparently cause some unwanted secondary effects > (lowering the clock). > > I agree with your proposal to leave the change in the tree some more > time. At the end, we could raise the costs for all targets to 8 to > effectively revert to the previous state. > > [1] https://gcc.gnu.org/ml/gcc-patches/2019-08/msg02009.html Those SPEC regressions sound similar to what I saw when trying to give proper costing to moves between general regs and vsx regs on Power9. rs6000.c:rs6000_ira_change_pseudo_allocno_class is the hack I used to work around bad register allocation decisions due to poor register pressure calculation.
On August 31, 2019 2:51:51 AM GMT+02:00, Alan Modra <amodra@gmail.com> wrote: >On Fri, Aug 30, 2019 at 09:42:06AM +0200, Uros Bizjak wrote: >> On Fri, Aug 30, 2019 at 9:22 AM Richard Biener >> <richard.guenther@gmail.com> wrote: >> > >> > On Thu, Aug 29, 2019 at 9:54 AM Uros Bizjak <ubizjak@gmail.com> >wrote: >> > > >> > > On Wed, Aug 28, 2019 at 5:12 PM Uros Bizjak <ubizjak@gmail.com> >wrote: >> > > > >> > > > Attached patch improves costing for STV shifts and corrects >reject >> > > > condition for out of range shift count operands. >> > > > >> > > > 2019-08-28 Uroš Bizjak <ubizjak@gmail.com> >> > > > >> > > > * config/i386/i386-features.c >> > > > (general_scalar_chain::compute_convert_gain): >> > > > Correct cost for double-word shifts. >> > > > (general_scalar_to_vector_candidate_p): Reject count >operands >> > > > greater or equal to mode bitsize. >> > > > >> > > > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}. >> > > > >> > > > Committed to mainline SVN. >> > > >> > > Ouch... I mixed up patches and actually committed the patch that >> > > removes maximum from cost of sse<->int moves. >> > > >> > > I can leave the patch for a day, so we can see the effects of the >cost >> > > change, and if the patch creates problems, I'll revert it. >> > >> > Regresses gromacs and namd quite a bit on Haswell, also perl in >SPEC 2000. >> > I guess we should try understand why rather than reverting >immediately >> > (I'd leave it in at least another few days to get more testers pick >up the >> > rev.). >> >> The correct patch is actually [1] (I have updated the subject): >> >> 2019-08-28 Uroš Bizjak <ubizjak@gmail.com> >> >> * config/i386/i386.c (ix86_register_move_cost): Do not >> limit the cost of moves to/from XMM register to minimum 8. >> >> There is no technical reason to limit the costs to minimum 8 anymore, >> and several targets (e.g skylake) also claim that moves between SSE >> and general regs are as cheap as moves between general regs. That's likely the issue since moves between regs in the same class can be usually dealt with in register renaming while between classes that is not generally possible. So I'd make those bigger cost. Otoh moves between same class regs are likely overcounted. However, >> these values were never benchmarked properly due to the mentioned >> limitation and apparently cause some unwanted secondary effects >> (lowering the clock). >> >> I agree with your proposal to leave the change in the tree some more >> time. At the end, we could raise the costs for all targets to 8 to >> effectively revert to the previous state. >> >> [1] https://gcc.gnu.org/ml/gcc-patches/2019-08/msg02009.html > >Those SPEC regressions sound similar to what I saw when trying to give >proper costing to moves between general regs and vsx regs on Power9. >rs6000.c:rs6000_ira_change_pseudo_allocno_class is the hack I used to >work around bad register allocation decisions due to poor register >pressure calculation.
On Sat, Aug 31, 2019 at 3:57 PM Richard Biener <richard.guenther@gmail.com> wrote: > > On August 31, 2019 2:51:51 AM GMT+02:00, Alan Modra <amodra@gmail.com> wrote: > >On Fri, Aug 30, 2019 at 09:42:06AM +0200, Uros Bizjak wrote: > >> On Fri, Aug 30, 2019 at 9:22 AM Richard Biener > >> <richard.guenther@gmail.com> wrote: > >> > > >> > On Thu, Aug 29, 2019 at 9:54 AM Uros Bizjak <ubizjak@gmail.com> > >wrote: > >> > > > >> > > On Wed, Aug 28, 2019 at 5:12 PM Uros Bizjak <ubizjak@gmail.com> > >wrote: > >> > > > > >> > > > Attached patch improves costing for STV shifts and corrects > >reject > >> > > > condition for out of range shift count operands. > >> > > > > >> > > > 2019-08-28 Uroš Bizjak <ubizjak@gmail.com> > >> > > > > >> > > > * config/i386/i386-features.c > >> > > > (general_scalar_chain::compute_convert_gain): > >> > > > Correct cost for double-word shifts. > >> > > > (general_scalar_to_vector_candidate_p): Reject count > >operands > >> > > > greater or equal to mode bitsize. > >> > > > > >> > > > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}. > >> > > > > >> > > > Committed to mainline SVN. > >> > > > >> > > Ouch... I mixed up patches and actually committed the patch that > >> > > removes maximum from cost of sse<->int moves. > >> > > > >> > > I can leave the patch for a day, so we can see the effects of the > >cost > >> > > change, and if the patch creates problems, I'll revert it. > >> > > >> > Regresses gromacs and namd quite a bit on Haswell, also perl in > >SPEC 2000. > >> > I guess we should try understand why rather than reverting > >immediately > >> > (I'd leave it in at least another few days to get more testers pick > >up the > >> > rev.). > >> > >> The correct patch is actually [1] (I have updated the subject): > >> > >> 2019-08-28 Uroš Bizjak <ubizjak@gmail.com> > >> > >> * config/i386/i386.c (ix86_register_move_cost): Do not > >> limit the cost of moves to/from XMM register to minimum 8. > >> > >> There is no technical reason to limit the costs to minimum 8 anymore, > >> and several targets (e.g skylake) also claim that moves between SSE > >> and general regs are as cheap as moves between general regs. > > That's likely the issue since moves between regs in the same class can be usually dealt with in register renaming while between classes that is not generally possible. So I'd make those bigger cost. Otoh moves between same class regs are likely overcounted. > > However, > >> these values were never benchmarked properly due to the mentioned > >> limitation and apparently cause some unwanted secondary effects > >> (lowering the clock). > >> > >> I agree with your proposal to leave the change in the tree some more > >> time. At the end, we could raise the costs for all targets to 8 to > >> effectively revert to the previous state. > >> > >> [1] https://gcc.gnu.org/ml/gcc-patches/2019-08/msg02009.html > > > >Those SPEC regressions sound similar to what I saw when trying to give > >proper costing to moves between general regs and vsx regs on Power9. > >rs6000.c:rs6000_ira_change_pseudo_allocno_class is the hack I used to > >work around bad register allocation decisions due to poor register > >pressure calculation. Alan, thanks for the pointer, I take a look at the extensive comments in and above rs6000_ira_change_pseudo_allocno_class and also inside rs6000_register_move_cost. According to the comment to rs6000_i_c_p_a_c: --quote-- The register allocator chooses GEN_OR_VSX_REGS for the allocno class if GENERAL_REGS and VSX_REGS cost is lower than the memory cost. This happens a lot when TARGET_DIRECT_MOVE makes the register move cost between GENERAL_REGS and VSX_REGS low. It might seem reasonable to use a union class. After all, if usage of vsr is low and gpr high, it might make sense to spill gpr to vsr rather than memory. However, in cases where register pressure of both is high, like the cactus_adm spec test, allowing GEN_OR_VSX_REGS as the allocno class results in bad decisions in the first scheduling pass. This is partly due to an allocno of GEN_OR_VSX_REGS wrongly contributing to the GENERAL_REGS pressure class, which gives too high a pressure for GENERAL_REGS and too low for VSX_REGS. So, force a choice of the subclass here. --/quote-- we can take a parallel with x86's SSE and integer registers, where both classes can hold SImode and DImode values. The attached patch is the first try to implement the idea of forcing a subclass (I must admit that the patch is a bit of a shot in the dark...). Also, the comment in the rs6000_register_move_cost says: /* Keep the cost for direct moves above that for within a register class even if the actual processor cost is comparable. We do this because a direct move insn can't be a nop, whereas with ideal register allocation a move within the same class might turn out to be a nop. */ which is not the case with core_cost (and similar with skylake_cost): 2, 2, 4, /* cost of moving XMM,YMM,ZMM register */ {6, 6, 6, 6, 12}, /* cost of loading SSE registers in 32,64,128,256 and 512-bit */ {6, 6, 6, 6, 12}, /* cost of storing SSE registers in 32,64,128,256 and 512-bit */ 2, 2, /* SSE->integer and integer->SSE moves */ We have the same cost of moving between integer registers (by default set to 2), between SSE registers and between integer and SSE register sets. I think that at least the cost of moves between regsets should be substantially higher, rs6000 uses 3x cost of intra-regset moves; that would translate to the value of 6. The value should be low enough to keep the cost below the value that forces move through the memory. Changing core register allocation cost of SSE <-> integer to: --cut here-- Index: config/i386/x86-tune-costs.h =================================================================== --- config/i386/x86-tune-costs.h (revision 275281) +++ config/i386/x86-tune-costs.h (working copy) @@ -2555,7 +2555,7 @@ struct processor_costs core_cost = { in 32,64,128,256 and 512-bit */ {6, 6, 6, 6, 12}, /* cost of storing SSE registers in 32,64,128,256 and 512-bit */ - 2, 2, /* SSE->integer and integer->SSE moves */ + 6, 6, /* SSE->integer and integer->SSE moves */ /* End of register allocator costs. */ }, --cut here-- still produces direct move in gcc.target/i386/minmax-6.c I think that in addition to attached patch, values between 2 and 6 should be considered in benchmarking. Unfortunately, without access to regressed SPEC tests, I can't analyse these changes by myself. Uros. Index: config/i386/i386.c =================================================================== --- config/i386/i386.c (revision 275198) +++ config/i386/i386.c (working copy) @@ -18632,6 +18632,25 @@ return 2; } +/* Implement TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS. */ + +static reg_class_t +x86_ira_change_pseudo_allocno_class (int regno ATTRIBUTE_UNUSED, + reg_class_t allocno_class, + reg_class_t best_class) +{ + switch (allocno_class) + { + case INT_SSE_REGS: + return best_class; + + default: + break; + } + + return allocno_class; +} + /* Implement TARGET_HARD_REGNO_NREGS. This is ordinarily the length in words of a value of mode MODE but can be less for certain modes in special long registers. @@ -22746,6 +22765,9 @@ #define TARGET_REGISTER_MOVE_COST ix86_register_move_cost #undef TARGET_MEMORY_MOVE_COST #define TARGET_MEMORY_MOVE_COST ix86_memory_move_cost +#undef TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS +#define TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS \ + x86_ira_change_pseudo_allocno_class #undef TARGET_RTX_COSTS #define TARGET_RTX_COSTS ix86_rtx_costs #undef TARGET_ADDRESS_COST
On Sun, Sep 01, 2019 at 09:48:49PM +0200, Uros Bizjak wrote: > the first try to implement the idea of forcing a subclass (I must > admit that the patch is a bit of a shot in the dark...). Yes, keep in mind that rs6000_ira_change_pseudo_allocno_class is a hack, and one that might only be useful with TARGET_COMPUTE_PRESSURE_CLASSES (and possibly SCHED_PRESSURE_MODEL too).
> which is not the case with core_cost (and similar with skylake_cost): > > 2, 2, 4, /* cost of moving XMM,YMM,ZMM register */ > {6, 6, 6, 6, 12}, /* cost of loading SSE registers > in 32,64,128,256 and 512-bit */ > {6, 6, 6, 6, 12}, /* cost of storing SSE registers > in 32,64,128,256 and 512-bit */ > 2, 2, /* SSE->integer and integer->SSE moves */ > > We have the same cost of moving between integer registers (by default > set to 2), between SSE registers and between integer and SSE register > sets. I think that at least the cost of moves between regsets should > be substantially higher, rs6000 uses 3x cost of intra-regset moves; > that would translate to the value of 6. The value should be low enough > to keep the cost below the value that forces move through the memory. > Changing core register allocation cost of SSE <-> integer to: > > --cut here-- > Index: config/i386/x86-tune-costs.h > =================================================================== > --- config/i386/x86-tune-costs.h (revision 275281) > +++ config/i386/x86-tune-costs.h (working copy) > @@ -2555,7 +2555,7 @@ struct processor_costs core_cost = { > in 32,64,128,256 and 512-bit */ > {6, 6, 6, 6, 12}, /* cost of storing SSE registers > in 32,64,128,256 and 512-bit */ > - 2, 2, /* SSE->integer and > integer->SSE moves */ > + 6, 6, /* SSE->integer and > integer->SSE moves */ > /* End of register allocator costs. */ > }, > > --cut here-- > > still produces direct move in gcc.target/i386/minmax-6.c > > I think that in addition to attached patch, values between 2 and 6 > should be considered in benchmarking. Unfortunately, without access to > regressed SPEC tests, I can't analyse these changes by myself. > > Uros. Apply similar change to skylake_cost, on skylake workstation we got performance like: --------------------------- version | 548_exchange_r score gcc10_20180822: | 10 apply remove_max8 | 8.9 also apply increase integer_tofrom_sse cost | 9.69 ----------------------------- Still 3% regression which is related to _gfortran_mminloc0_4_i4 in libgfortran.so.5.0.0. I found suspicious code as bellow, does it affect? ------------------ modified gcc/config/i386/i386-features.c @@ -590,7 +590,7 @@ general_scalar_chain::compute_convert_gain () if (dump_file) fprintf (dump_file, " Instruction conversion gain: %d\n", gain); - /* ??? What about integer to SSE? */ + /* ??? What about integer to SSE? */??? EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; ------------------
On Mon, Sep 2, 2019 at 10:13 AM Hongtao Liu <crazylht@gmail.com> wrote: > > > which is not the case with core_cost (and similar with skylake_cost): > > > > 2, 2, 4, /* cost of moving XMM,YMM,ZMM register */ > > {6, 6, 6, 6, 12}, /* cost of loading SSE registers > > in 32,64,128,256 and 512-bit */ > > {6, 6, 6, 6, 12}, /* cost of storing SSE registers > > in 32,64,128,256 and 512-bit */ > > 2, 2, /* SSE->integer and integer->SSE moves */ > > > > We have the same cost of moving between integer registers (by default > > set to 2), between SSE registers and between integer and SSE register > > sets. I think that at least the cost of moves between regsets should > > be substantially higher, rs6000 uses 3x cost of intra-regset moves; > > that would translate to the value of 6. The value should be low enough > > to keep the cost below the value that forces move through the memory. > > Changing core register allocation cost of SSE <-> integer to: > > > > --cut here-- > > Index: config/i386/x86-tune-costs.h > > =================================================================== > > --- config/i386/x86-tune-costs.h (revision 275281) > > +++ config/i386/x86-tune-costs.h (working copy) > > @@ -2555,7 +2555,7 @@ struct processor_costs core_cost = { > > in 32,64,128,256 and 512-bit */ > > {6, 6, 6, 6, 12}, /* cost of storing SSE registers > > in 32,64,128,256 and 512-bit */ > > - 2, 2, /* SSE->integer and > > integer->SSE moves */ > > + 6, 6, /* SSE->integer and > > integer->SSE moves */ > > /* End of register allocator costs. */ > > }, > > > > --cut here-- > > > > still produces direct move in gcc.target/i386/minmax-6.c > > > > I think that in addition to attached patch, values between 2 and 6 > > should be considered in benchmarking. Unfortunately, without access to > > regressed SPEC tests, I can't analyse these changes by myself. > > > > Uros. > > Apply similar change to skylake_cost, on skylake workstation we got > performance like: > --------------------------- > version | > 548_exchange_r score > gcc10_20180822: | 10 > apply remove_max8 | 8.9 > also apply increase integer_tofrom_sse cost | 9.69 > ----------------------------- > Still 3% regression which is related to _gfortran_mminloc0_4_i4 in > libgfortran.so.5.0.0. > > I found suspicious code as bellow, does it affect? Hard to say without access to the test, but I'm glad that changing the knob has noticeable effect. I think that (as said by Alan) a fine-tune of register pressure calculation will be needed to push this forward. Uros. > ------------------ > modified gcc/config/i386/i386-features.c > @@ -590,7 +590,7 @@ general_scalar_chain::compute_convert_gain () > if (dump_file) > fprintf (dump_file, " Instruction conversion gain: %d\n", gain); > > - /* ??? What about integer to SSE? */ > + /* ??? What about integer to SSE? */??? > EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) > cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; > ------------------ > -- > BR, > Hongtao
On Mon, Sep 2, 2019 at 10:13 AM Hongtao Liu <crazylht@gmail.com> wrote: > > > which is not the case with core_cost (and similar with skylake_cost): > > > > 2, 2, 4, /* cost of moving XMM,YMM,ZMM register */ > > {6, 6, 6, 6, 12}, /* cost of loading SSE registers > > in 32,64,128,256 and 512-bit */ > > {6, 6, 6, 6, 12}, /* cost of storing SSE registers > > in 32,64,128,256 and 512-bit */ > > 2, 2, /* SSE->integer and integer->SSE moves */ > > > > We have the same cost of moving between integer registers (by default > > set to 2), between SSE registers and between integer and SSE register > > sets. I think that at least the cost of moves between regsets should > > be substantially higher, rs6000 uses 3x cost of intra-regset moves; > > that would translate to the value of 6. The value should be low enough > > to keep the cost below the value that forces move through the memory. > > Changing core register allocation cost of SSE <-> integer to: > > > > --cut here-- > > Index: config/i386/x86-tune-costs.h > > =================================================================== > > --- config/i386/x86-tune-costs.h (revision 275281) > > +++ config/i386/x86-tune-costs.h (working copy) > > @@ -2555,7 +2555,7 @@ struct processor_costs core_cost = { > > in 32,64,128,256 and 512-bit */ > > {6, 6, 6, 6, 12}, /* cost of storing SSE registers > > in 32,64,128,256 and 512-bit */ > > - 2, 2, /* SSE->integer and > > integer->SSE moves */ > > + 6, 6, /* SSE->integer and > > integer->SSE moves */ > > /* End of register allocator costs. */ > > }, > > > > --cut here-- > > > > still produces direct move in gcc.target/i386/minmax-6.c > > > > I think that in addition to attached patch, values between 2 and 6 > > should be considered in benchmarking. Unfortunately, without access to > > regressed SPEC tests, I can't analyse these changes by myself. > > > > Uros. > > Apply similar change to skylake_cost, on skylake workstation we got > performance like: > --------------------------- > version | > 548_exchange_r score > gcc10_20180822: | 10 > apply remove_max8 | 8.9 > also apply increase integer_tofrom_sse cost | 9.69 > ----------------------------- > Still 3% regression which is related to _gfortran_mminloc0_4_i4 in > libgfortran.so.5.0.0. > > I found suspicious code as bellow, does it affect? This should be fixed after 2019-08-27 Richard Biener <rguenther@suse.de> * config/i386/i386-features.h (general_scalar_chain::~general_scalar_chain): Add. (general_scalar_chain::insns_conv): New bitmap. (general_scalar_chain::n_sse_to_integer): New. (general_scalar_chain::n_integer_to_sse): Likewise. (general_scalar_chain::make_vector_copies): Adjust signature. * config/i386/i386-features.c (general_scalar_chain::general_scalar_chain): Outline, initialize new members. (general_scalar_chain::~general_scalar_chain): New. (general_scalar_chain::mark_dual_mode_def): Record insns we need to insert conversions at and count them. (general_scalar_chain::compute_convert_gain): Account for conversion instructions at chain boundary. (general_scalar_chain::make_vector_copies): Generate a single copy for a def by a specific insn. (general_scalar_chain::convert_registers): First populate defs_map, then make copies at out-of chain insns. where the only ??? is that we have const int sse_to_integer; /* cost of moving SSE register to integer. */ but not integer_to_sse. In the hard_register sub-struct of processor_cost we have both: const int sse_to_integer; /* cost of moving SSE register to integer. */ const int integer_to_sse; /* cost of moving integer register to SSE. */ IMHO that we have mostly the same kind of costs two times is odd. And the compute_convert_gain function adds up apples and oranges. > ------------------ > modified gcc/config/i386/i386-features.c > @@ -590,7 +590,7 @@ general_scalar_chain::compute_convert_gain () > if (dump_file) > fprintf (dump_file, " Instruction conversion gain: %d\n", gain); > > - /* ??? What about integer to SSE? */ > + /* ??? What about integer to SSE? */??? > EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) > cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; > ------------------ > -- > BR, > Hongtao
On Mon, Sep 2, 2019 at 6:23 PM Richard Biener <richard.guenther@gmail.com> wrote: > > On Mon, Sep 2, 2019 at 10:13 AM Hongtao Liu <crazylht@gmail.com> wrote: > > > > > which is not the case with core_cost (and similar with skylake_cost): > > > > > > 2, 2, 4, /* cost of moving XMM,YMM,ZMM register */ > > > {6, 6, 6, 6, 12}, /* cost of loading SSE registers > > > in 32,64,128,256 and 512-bit */ > > > {6, 6, 6, 6, 12}, /* cost of storing SSE registers > > > in 32,64,128,256 and 512-bit */ > > > 2, 2, /* SSE->integer and integer->SSE moves */ > > > > > > We have the same cost of moving between integer registers (by default > > > set to 2), between SSE registers and between integer and SSE register > > > sets. I think that at least the cost of moves between regsets should > > > be substantially higher, rs6000 uses 3x cost of intra-regset moves; > > > that would translate to the value of 6. The value should be low enough > > > to keep the cost below the value that forces move through the memory. > > > Changing core register allocation cost of SSE <-> integer to: > > > > > > --cut here-- > > > Index: config/i386/x86-tune-costs.h > > > =================================================================== > > > --- config/i386/x86-tune-costs.h (revision 275281) > > > +++ config/i386/x86-tune-costs.h (working copy) > > > @@ -2555,7 +2555,7 @@ struct processor_costs core_cost = { > > > in 32,64,128,256 and 512-bit */ > > > {6, 6, 6, 6, 12}, /* cost of storing SSE registers > > > in 32,64,128,256 and 512-bit */ > > > - 2, 2, /* SSE->integer and > > > integer->SSE moves */ > > > + 6, 6, /* SSE->integer and > > > integer->SSE moves */ > > > /* End of register allocator costs. */ > > > }, > > > > > > --cut here-- > > > > > > still produces direct move in gcc.target/i386/minmax-6.c > > > > > > I think that in addition to attached patch, values between 2 and 6 > > > should be considered in benchmarking. Unfortunately, without access to > > > regressed SPEC tests, I can't analyse these changes by myself. > > > > > > Uros. > > > > Apply similar change to skylake_cost, on skylake workstation we got > > performance like: > > --------------------------- > > version | > > 548_exchange_r score > > gcc10_20180822: | 10 > > apply remove_max8 | 8.9 > > also apply increase integer_tofrom_sse cost | 9.69 > > ----------------------------- > > Still 3% regression which is related to _gfortran_mminloc0_4_i4 in > > libgfortran.so.5.0.0. > > > > I found suspicious code as bellow, does it affect? > > This should be fixed after > > 2019-08-27 Richard Biener <rguenther@suse.de> > > * config/i386/i386-features.h > (general_scalar_chain::~general_scalar_chain): Add. > (general_scalar_chain::insns_conv): New bitmap. > (general_scalar_chain::n_sse_to_integer): New. > (general_scalar_chain::n_integer_to_sse): Likewise. > (general_scalar_chain::make_vector_copies): Adjust signature. > * config/i386/i386-features.c > (general_scalar_chain::general_scalar_chain): Outline, > initialize new members. > (general_scalar_chain::~general_scalar_chain): New. > (general_scalar_chain::mark_dual_mode_def): Record insns > we need to insert conversions at and count them. > (general_scalar_chain::compute_convert_gain): Account > for conversion instructions at chain boundary. > (general_scalar_chain::make_vector_copies): Generate a single > copy for a def by a specific insn. > (general_scalar_chain::convert_registers): First populate > defs_map, then make copies at out-of chain insns. > > where the only ??? is that we have > > const int sse_to_integer; /* cost of moving SSE register to integer. */ > > but not integer_to_sse. In the hard_register sub-struct of processor_cost Yes. > we have both: > > const int sse_to_integer; /* cost of moving SSE register to integer. */ > const int integer_to_sse; /* cost of moving integer register to SSE. */ > > IMHO that we have mostly the same kind of costs two times is odd. They are used for different purposes(one for register allocation, one for rtx_cost). Changing cost for register allocation shouldn't affect rtx_cost which would be used somewhere else. > And the compute_convert_gain function adds up apples and oranges. > > > ------------------ > > modified gcc/config/i386/i386-features.c > > @@ -590,7 +590,7 @@ general_scalar_chain::compute_convert_gain () > > if (dump_file) > > fprintf (dump_file, " Instruction conversion gain: %d\n", gain); > > > > - /* ??? What about integer to SSE? */ > > + /* ??? What about integer to SSE? */??? > > EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) > > cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; > > ------------------ > > -- > > BR, > > Hongtao -- BR, Hongtao
On Mon, Sep 2, 2019 at 4:41 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Mon, Sep 2, 2019 at 10:13 AM Hongtao Liu <crazylht@gmail.com> wrote: > > > > > which is not the case with core_cost (and similar with skylake_cost): > > > > > > 2, 2, 4, /* cost of moving XMM,YMM,ZMM register */ > > > {6, 6, 6, 6, 12}, /* cost of loading SSE registers > > > in 32,64,128,256 and 512-bit */ > > > {6, 6, 6, 6, 12}, /* cost of storing SSE registers > > > in 32,64,128,256 and 512-bit */ > > > 2, 2, /* SSE->integer and integer->SSE moves */ > > > > > > We have the same cost of moving between integer registers (by default > > > set to 2), between SSE registers and between integer and SSE register > > > sets. I think that at least the cost of moves between regsets should > > > be substantially higher, rs6000 uses 3x cost of intra-regset moves; > > > that would translate to the value of 6. The value should be low enough > > > to keep the cost below the value that forces move through the memory. > > > Changing core register allocation cost of SSE <-> integer to: > > > > > > --cut here-- > > > Index: config/i386/x86-tune-costs.h > > > =================================================================== > > > --- config/i386/x86-tune-costs.h (revision 275281) > > > +++ config/i386/x86-tune-costs.h (working copy) > > > @@ -2555,7 +2555,7 @@ struct processor_costs core_cost = { > > > in 32,64,128,256 and 512-bit */ > > > {6, 6, 6, 6, 12}, /* cost of storing SSE registers > > > in 32,64,128,256 and 512-bit */ > > > - 2, 2, /* SSE->integer and > > > integer->SSE moves */ > > > + 6, 6, /* SSE->integer and > > > integer->SSE moves */ > > > /* End of register allocator costs. */ > > > }, > > > > > > --cut here-- > > > > > > still produces direct move in gcc.target/i386/minmax-6.c > > > > > > I think that in addition to attached patch, values between 2 and 6 > > > should be considered in benchmarking. Unfortunately, without access to > > > regressed SPEC tests, I can't analyse these changes by myself. > > > > > > Uros. > > > > Apply similar change to skylake_cost, on skylake workstation we got > > performance like: > > --------------------------- > > version | > > 548_exchange_r score > > gcc10_20180822: | 10 > > apply remove_max8 | 8.9 > > also apply increase integer_tofrom_sse cost | 9.69 > > ----------------------------- > > Still 3% regression which is related to _gfortran_mminloc0_4_i4 in > > libgfortran.so.5.0.0. > > > > I found suspicious code as bellow, does it affect? > > Hard to say without access to the test, but I'm glad that changing the > knob has noticeable effect. I think that (as said by Alan) a fine-tune > of register pressure calculation will be needed to push this forward. > > Uros. > > > ------------------ > > modified gcc/config/i386/i386-features.c > > @@ -590,7 +590,7 @@ general_scalar_chain::compute_convert_gain () > > if (dump_file) > > fprintf (dump_file, " Instruction conversion gain: %d\n", gain); > > > > - /* ??? What about integer to SSE? */ > > + /* ??? What about integer to SSE? */??? > > EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) > > cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; > > ------------------ > > -- > > BR, > > Hongtao Note: Removing limit of cost would introduce lots of regressions in SPEC2017 as follow -------------------------------- 531.deepsjeng_r -7.18% 548.exchange_r -6.70% 557.xz_r -6.74% 508.namd_r -2.81% 527.cam4_r -6.48% 544.nab_r -3.99% Tested on skylake server. ------------------------------------- How about changing cost from 2 to 8 until we figure out a better number.
On Tue, Sep 3, 2019 at 9:57 AM Hongtao Liu <crazylht@gmail.com> wrote: > > On Mon, Sep 2, 2019 at 4:41 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > On Mon, Sep 2, 2019 at 10:13 AM Hongtao Liu <crazylht@gmail.com> wrote: > > > > > > > which is not the case with core_cost (and similar with skylake_cost): > > > > > > > > 2, 2, 4, /* cost of moving XMM,YMM,ZMM register */ > > > > {6, 6, 6, 6, 12}, /* cost of loading SSE registers > > > > in 32,64,128,256 and 512-bit */ > > > > {6, 6, 6, 6, 12}, /* cost of storing SSE registers > > > > in 32,64,128,256 and 512-bit */ > > > > 2, 2, /* SSE->integer and integer->SSE moves */ > > > > > > > > We have the same cost of moving between integer registers (by default > > > > set to 2), between SSE registers and between integer and SSE register > > > > sets. I think that at least the cost of moves between regsets should > > > > be substantially higher, rs6000 uses 3x cost of intra-regset moves; > > > > that would translate to the value of 6. The value should be low enough > > > > to keep the cost below the value that forces move through the memory. > > > > Changing core register allocation cost of SSE <-> integer to: > > > > > > > > --cut here-- > > > > Index: config/i386/x86-tune-costs.h > > > > =================================================================== > > > > --- config/i386/x86-tune-costs.h (revision 275281) > > > > +++ config/i386/x86-tune-costs.h (working copy) > > > > @@ -2555,7 +2555,7 @@ struct processor_costs core_cost = { > > > > in 32,64,128,256 and 512-bit */ > > > > {6, 6, 6, 6, 12}, /* cost of storing SSE registers > > > > in 32,64,128,256 and 512-bit */ > > > > - 2, 2, /* SSE->integer and > > > > integer->SSE moves */ > > > > + 6, 6, /* SSE->integer and > > > > integer->SSE moves */ > > > > /* End of register allocator costs. */ > > > > }, > > > > > > > > --cut here-- > > > > > > > > still produces direct move in gcc.target/i386/minmax-6.c > > > > > > > > I think that in addition to attached patch, values between 2 and 6 > > > > should be considered in benchmarking. Unfortunately, without access to > > > > regressed SPEC tests, I can't analyse these changes by myself. > > > > > > > > Uros. > > > > > > Apply similar change to skylake_cost, on skylake workstation we got > > > performance like: > > > --------------------------- > > > version | > > > 548_exchange_r score > > > gcc10_20180822: | 10 > > > apply remove_max8 | 8.9 > > > also apply increase integer_tofrom_sse cost | 9.69 > > > ----------------------------- > > > Still 3% regression which is related to _gfortran_mminloc0_4_i4 in > > > libgfortran.so.5.0.0. > > > > > > I found suspicious code as bellow, does it affect? > > > > Hard to say without access to the test, but I'm glad that changing the > > knob has noticeable effect. I think that (as said by Alan) a fine-tune > > of register pressure calculation will be needed to push this forward. > > > > Uros. > > > > > ------------------ > > > modified gcc/config/i386/i386-features.c > > > @@ -590,7 +590,7 @@ general_scalar_chain::compute_convert_gain () > > > if (dump_file) > > > fprintf (dump_file, " Instruction conversion gain: %d\n", gain); > > > > > > - /* ??? What about integer to SSE? */ > > > + /* ??? What about integer to SSE? */??? > > > EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) > > > cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; > > > ------------------ > > > -- > > > BR, > > > Hongtao > > Note: > Removing limit of cost would introduce lots of regressions in SPEC2017 as follow > -------------------------------- > 531.deepsjeng_r -7.18% > 548.exchange_r -6.70% > 557.xz_r -6.74% > 508.namd_r -2.81% > 527.cam4_r -6.48% > 544.nab_r -3.99% > > Tested on skylake server. > ------------------------------------- > How about changing cost from 2 to 8 until we figure out a better number. Certainly works for me. Note the STV code uses the "other" sse_to_integer number and the numbers in question here are those for the RA. There's a multitude of values used in the tables here, including some a lot larger. So the overall bumping to 8 certainly was the wrong thing to do and instead individual numbers should have been adjusted (didn't look at the history of that bumping). For example Pentium4 has quite high bases for move costs, like xmm <-> xmm move costing 12 and SSE->integer costing 20 while the opposite 12. So yes, we want to revert the patch by applying its effect to the individual cost tables so we can revisit this for the still interesting micro-architectures. Richard. > -- > BR, > Hongtao
On Tue, Sep 3, 2019 at 1:24 PM Richard Biener <richard.guenther@gmail.com> wrote: > > On Tue, Sep 3, 2019 at 9:57 AM Hongtao Liu <crazylht@gmail.com> wrote: > > > > On Mon, Sep 2, 2019 at 4:41 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > On Mon, Sep 2, 2019 at 10:13 AM Hongtao Liu <crazylht@gmail.com> wrote: > > > > > > > > > which is not the case with core_cost (and similar with skylake_cost): > > > > > > > > > > 2, 2, 4, /* cost of moving XMM,YMM,ZMM register */ > > > > > {6, 6, 6, 6, 12}, /* cost of loading SSE registers > > > > > in 32,64,128,256 and 512-bit */ > > > > > {6, 6, 6, 6, 12}, /* cost of storing SSE registers > > > > > in 32,64,128,256 and 512-bit */ > > > > > 2, 2, /* SSE->integer and integer->SSE moves */ > > > > > > > > > > We have the same cost of moving between integer registers (by default > > > > > set to 2), between SSE registers and between integer and SSE register > > > > > sets. I think that at least the cost of moves between regsets should > > > > > be substantially higher, rs6000 uses 3x cost of intra-regset moves; > > > > > that would translate to the value of 6. The value should be low enough > > > > > to keep the cost below the value that forces move through the memory. > > > > > Changing core register allocation cost of SSE <-> integer to: > > > > > > > > > > --cut here-- > > > > > Index: config/i386/x86-tune-costs.h > > > > > =================================================================== > > > > > --- config/i386/x86-tune-costs.h (revision 275281) > > > > > +++ config/i386/x86-tune-costs.h (working copy) > > > > > @@ -2555,7 +2555,7 @@ struct processor_costs core_cost = { > > > > > in 32,64,128,256 and 512-bit */ > > > > > {6, 6, 6, 6, 12}, /* cost of storing SSE registers > > > > > in 32,64,128,256 and 512-bit */ > > > > > - 2, 2, /* SSE->integer and > > > > > integer->SSE moves */ > > > > > + 6, 6, /* SSE->integer and > > > > > integer->SSE moves */ > > > > > /* End of register allocator costs. */ > > > > > }, > > > > > > > > > > --cut here-- > > > > > > > > > > still produces direct move in gcc.target/i386/minmax-6.c > > > > > > > > > > I think that in addition to attached patch, values between 2 and 6 > > > > > should be considered in benchmarking. Unfortunately, without access to > > > > > regressed SPEC tests, I can't analyse these changes by myself. > > > > > > > > > > Uros. > > > > > > > > Apply similar change to skylake_cost, on skylake workstation we got > > > > performance like: > > > > --------------------------- > > > > version | > > > > 548_exchange_r score > > > > gcc10_20180822: | 10 > > > > apply remove_max8 | 8.9 > > > > also apply increase integer_tofrom_sse cost | 9.69 > > > > ----------------------------- > > > > Still 3% regression which is related to _gfortran_mminloc0_4_i4 in > > > > libgfortran.so.5.0.0. > > > > > > > > I found suspicious code as bellow, does it affect? > > > > > > Hard to say without access to the test, but I'm glad that changing the > > > knob has noticeable effect. I think that (as said by Alan) a fine-tune > > > of register pressure calculation will be needed to push this forward. > > > > > > Uros. > > > > > > > ------------------ > > > > modified gcc/config/i386/i386-features.c > > > > @@ -590,7 +590,7 @@ general_scalar_chain::compute_convert_gain () > > > > if (dump_file) > > > > fprintf (dump_file, " Instruction conversion gain: %d\n", gain); > > > > > > > > - /* ??? What about integer to SSE? */ > > > > + /* ??? What about integer to SSE? */??? > > > > EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) > > > > cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; > > > > ------------------ > > > > -- > > > > BR, > > > > Hongtao > > > > Note: > > Removing limit of cost would introduce lots of regressions in SPEC2017 as follow > > -------------------------------- > > 531.deepsjeng_r -7.18% > > 548.exchange_r -6.70% > > 557.xz_r -6.74% > > 508.namd_r -2.81% > > 527.cam4_r -6.48% > > 544.nab_r -3.99% > > > > Tested on skylake server. > > ------------------------------------- > > How about changing cost from 2 to 8 until we figure out a better number. > > Certainly works for me. Note the STV code uses the "other" sse_to_integer > number and the numbers in question here are those for the RA. There's > a multitude of values used in the tables here, including some a lot larger. > So the overall bumping to 8 certainly was the wrong thing to do and instead > individual numbers should have been adjusted (didn't look at the history > of that bumping). For reference: r125951 | uros | 2007-06-22 19:51:06 +0200 (Fri, 22 Jun 2007) | 6 lines PR target/32413 * config/i386/i386.c (ix86_register_move_cost): Rise the cost of moves between MMX/SSE registers to at least 8 units to prevent ICE caused by non-tieable SI/HI/QImodes in SSE registers. should probably have been "twice the cost of X" or something like that instead where X be some reg-reg move cost. > For example Pentium4 has quite high bases for move > costs, like xmm <-> xmm move costing 12 and SSE->integer costing 20 > while the opposite 12. > > So yes, we want to revert the patch by applying its effect to the > individual cost tables so we can revisit this for the still interesting > micro-architectures. > > Richard. > > > -- > > BR, > > Hongtao
On Tue, Sep 3, 2019 at 1:33 PM Richard Biener <richard.guenther@gmail.com> wrote: > > > Note: > > > Removing limit of cost would introduce lots of regressions in SPEC2017 as follow > > > -------------------------------- > > > 531.deepsjeng_r -7.18% > > > 548.exchange_r -6.70% > > > 557.xz_r -6.74% > > > 508.namd_r -2.81% > > > 527.cam4_r -6.48% > > > 544.nab_r -3.99% > > > > > > Tested on skylake server. > > > ------------------------------------- > > > How about changing cost from 2 to 8 until we figure out a better number. > > > > Certainly works for me. Note the STV code uses the "other" sse_to_integer > > number and the numbers in question here are those for the RA. There's > > a multitude of values used in the tables here, including some a lot larger. > > So the overall bumping to 8 certainly was the wrong thing to do and instead > > individual numbers should have been adjusted (didn't look at the history > > of that bumping). > > For reference: > > r125951 | uros | 2007-06-22 19:51:06 +0200 (Fri, 22 Jun 2007) | 6 lines > > PR target/32413 > * config/i386/i386.c (ix86_register_move_cost): Rise the cost of > moves between MMX/SSE registers to at least 8 units to prevent > ICE caused by non-tieable SI/HI/QImodes in SSE registers. > > should probably have been "twice the cost of X" or something like that > instead where X be some reg-reg move cost. Thanks for the reference. It looks that the patch fixes the issue in the wrong place, this should be solved in inline_secondary_memory_needed: /* Between SSE and general, we have moves no larger than word size. */ if (!(INTEGER_CLASS_P (class1) || INTEGER_CLASS_P (class2)) || GET_MODE_SIZE (mode) < GET_MODE_SIZE (SImode) || GET_MODE_SIZE (mode) > UNITS_PER_WORD) return true; as an alternative to implement QI and HImode moves as a SImode move between SSE and int<->SSE registers. We have ix86_secondary_memory_needed_mode that extends QI and HImode secondary memory to SImode, so this should solve PR32413. Other than that, what to do with the bizzare property of direct moves that benchmark far worse than indirect moves? I was expecting that keeping the cost of direct inter-regset moves just a bit below the cost of int<->mem<->xmm, but (much ?) higher than itra-regset moves would prevent unwanted wandering of values between register sets, while still generating the direct move when needed. While this almost fixes the runtime regression, it is not clear to me from Hongtao Liu's message if Richard's 2019-08-27 fixes the remaining regression or not). Liu, can you please clarify? > > For example Pentium4 has quite high bases for move > > costs, like xmm <-> xmm move costing 12 and SSE->integer costing 20 > > while the opposite 12. > > > > So yes, we want to revert the patch by applying its effect to the > > individual cost tables so we can revisit this for the still interesting > > micro-architectures. Uros.
On Wed, Sep 4, 2019 at 12:50 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Tue, Sep 3, 2019 at 1:33 PM Richard Biener > <richard.guenther@gmail.com> wrote: > > > > > Note: > > > > Removing limit of cost would introduce lots of regressions in SPEC2017 as follow > > > > -------------------------------- > > > > 531.deepsjeng_r -7.18% > > > > 548.exchange_r -6.70% > > > > 557.xz_r -6.74% > > > > 508.namd_r -2.81% > > > > 527.cam4_r -6.48% > > > > 544.nab_r -3.99% > > > > > > > > Tested on skylake server. > > > > ------------------------------------- > > > > How about changing cost from 2 to 8 until we figure out a better number. > > > > > > Certainly works for me. Note the STV code uses the "other" sse_to_integer > > > number and the numbers in question here are those for the RA. There's > > > a multitude of values used in the tables here, including some a lot larger. > > > So the overall bumping to 8 certainly was the wrong thing to do and instead > > > individual numbers should have been adjusted (didn't look at the history > > > of that bumping). > > > > For reference: > > > > r125951 | uros | 2007-06-22 19:51:06 +0200 (Fri, 22 Jun 2007) | 6 lines > > > > PR target/32413 > > * config/i386/i386.c (ix86_register_move_cost): Rise the cost of > > moves between MMX/SSE registers to at least 8 units to prevent > > ICE caused by non-tieable SI/HI/QImodes in SSE registers. > > > > should probably have been "twice the cost of X" or something like that > > instead where X be some reg-reg move cost. > > Thanks for the reference. It looks that the patch fixes the issue in > the wrong place, this should be solved in > inline_secondary_memory_needed: > > /* Between SSE and general, we have moves no larger than word size. */ > if (!(INTEGER_CLASS_P (class1) || INTEGER_CLASS_P (class2)) > || GET_MODE_SIZE (mode) < GET_MODE_SIZE (SImode) > || GET_MODE_SIZE (mode) > UNITS_PER_WORD) > return true; > > as an alternative to implement QI and HImode moves as a SImode move > between SSE and int<->SSE registers. We have > ix86_secondary_memory_needed_mode that extends QI and HImode secondary > memory to SImode, so this should solve PR32413. > > Other than that, what to do with the bizzare property of direct moves > that benchmark far worse than indirect moves? I was expecting that > keeping the cost of direct inter-regset moves just a bit below the > cost of int<->mem<->xmm, but (much ?) higher than itra-regset moves > would prevent unwanted wandering of values between register sets, > while still generating the direct move when needed. While this almost I've not tested it yet. So i'll start a test about this patch(change cost from 2-->6) with Richard's change. I'll keep you informed when finishing test. > fixes the runtime regression, it is not clear to me from Hongtao Liu's > message if Richard's 2019-08-27 fixes the remaining regression or > not). Liu, can you please clarify? > -------------------------------- 531.deepsjeng_r -7.18% 548.exchange_r -6.70% 557.xz_r -6.74% 508.namd_r -2.81% 527.cam4_r -6.48% 544.nab_r -3.99% Tested on skylake server. ------------------------------------- Those regressions are comparing gcc10_20190830 to gcc10_20190824 which are mainly caused by removing limit of 8. > > > For example Pentium4 has quite high bases for move > > > costs, like xmm <-> xmm move costing 12 and SSE->integer costing 20 > > > while the opposite 12. > > > > > > So yes, we want to revert the patch by applying its effect to the > > > individual cost tables so we can revisit this for the still interesting > > > micro-architectures. > > Uros.
On Wed, Sep 4, 2019 at 9:44 AM Hongtao Liu <crazylht@gmail.com> wrote: > > On Wed, Sep 4, 2019 at 12:50 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > On Tue, Sep 3, 2019 at 1:33 PM Richard Biener > > <richard.guenther@gmail.com> wrote: > > > > > > > Note: > > > > > Removing limit of cost would introduce lots of regressions in SPEC2017 as follow > > > > > -------------------------------- > > > > > 531.deepsjeng_r -7.18% > > > > > 548.exchange_r -6.70% > > > > > 557.xz_r -6.74% > > > > > 508.namd_r -2.81% > > > > > 527.cam4_r -6.48% > > > > > 544.nab_r -3.99% > > > > > > > > > > Tested on skylake server. > > > > > ------------------------------------- > > > > > How about changing cost from 2 to 8 until we figure out a better number. > > > > > > > > Certainly works for me. Note the STV code uses the "other" sse_to_integer > > > > number and the numbers in question here are those for the RA. There's > > > > a multitude of values used in the tables here, including some a lot larger. > > > > So the overall bumping to 8 certainly was the wrong thing to do and instead > > > > individual numbers should have been adjusted (didn't look at the history > > > > of that bumping). > > > > > > For reference: > > > > > > r125951 | uros | 2007-06-22 19:51:06 +0200 (Fri, 22 Jun 2007) | 6 lines > > > > > > PR target/32413 > > > * config/i386/i386.c (ix86_register_move_cost): Rise the cost of > > > moves between MMX/SSE registers to at least 8 units to prevent > > > ICE caused by non-tieable SI/HI/QImodes in SSE registers. > > > > > > should probably have been "twice the cost of X" or something like that > > > instead where X be some reg-reg move cost. > > > > Thanks for the reference. It looks that the patch fixes the issue in > > the wrong place, this should be solved in > > inline_secondary_memory_needed: > > > > /* Between SSE and general, we have moves no larger than word size. */ > > if (!(INTEGER_CLASS_P (class1) || INTEGER_CLASS_P (class2)) > > || GET_MODE_SIZE (mode) < GET_MODE_SIZE (SImode) > > || GET_MODE_SIZE (mode) > UNITS_PER_WORD) > > return true; > > > > as an alternative to implement QI and HImode moves as a SImode move > > between SSE and int<->SSE registers. We have > > ix86_secondary_memory_needed_mode that extends QI and HImode secondary > > memory to SImode, so this should solve PR32413. > > > > Other than that, what to do with the bizzare property of direct moves > > that benchmark far worse than indirect moves? I was expecting that > > keeping the cost of direct inter-regset moves just a bit below the > > cost of int<->mem<->xmm, but (much ?) higher than itra-regset moves > > would prevent unwanted wandering of values between register sets, > > while still generating the direct move when needed. While this almost > > I've not tested it yet. > So i'll start a test about this patch(change cost from 2-->6) with > Richard's change. > I'll keep you informed when finishing test. > > > fixes the runtime regression, it is not clear to me from Hongtao Liu's > > message if Richard's 2019-08-27 fixes the remaining regression or > > not). Liu, can you please clarify? > > > -------------------------------- > 531.deepsjeng_r -7.18% > 548.exchange_r -6.70% > 557.xz_r -6.74% > 508.namd_r -2.81% > 527.cam4_r -6.48% > 544.nab_r -3.99% > > Tested on skylake server. > ------------------------------------- > Those regressions are comparing gcc10_20190830 to gcc10_20190824 which > are mainly caused by removing limit of 8. > > > > > For example Pentium4 has quite high bases for move > > > > costs, like xmm <-> xmm move costing 12 and SSE->integer costing 20 > > > > while the opposite 12. > > > > > > > > So yes, we want to revert the patch by applying its effect to the > > > > individual cost tables so we can revisit this for the still interesting > > > > micro-architectures. > > > > Uros. > > > > -- > BR, > Hongtao Change cost from 2->6 got ------------- 531.deepsjeng_r 9.64% 548.exchange_r 10.24% 557.xc_r 7.99% 508.namd_r 1.08% 527.cam4_r 6.91% 553.nab_r 3.06% ------------ for 531,548,557,527, even better comparing to version before regression. for 508,533, still little regressions comparing to version before regression.
On Thu, Sep 5, 2019 at 7:47 AM Hongtao Liu <crazylht@gmail.com> wrote: > > On Wed, Sep 4, 2019 at 9:44 AM Hongtao Liu <crazylht@gmail.com> wrote: > > > > On Wed, Sep 4, 2019 at 12:50 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > On Tue, Sep 3, 2019 at 1:33 PM Richard Biener > > > <richard.guenther@gmail.com> wrote: > > > > > > > > > Note: > > > > > > Removing limit of cost would introduce lots of regressions in SPEC2017 as follow > > > > > > -------------------------------- > > > > > > 531.deepsjeng_r -7.18% > > > > > > 548.exchange_r -6.70% > > > > > > 557.xz_r -6.74% > > > > > > 508.namd_r -2.81% > > > > > > 527.cam4_r -6.48% > > > > > > 544.nab_r -3.99% > > > > > > > > > > > > Tested on skylake server. > > > > > > ------------------------------------- > > > > > > How about changing cost from 2 to 8 until we figure out a better number. > > > > > > > > > > Certainly works for me. Note the STV code uses the "other" sse_to_integer > > > > > number and the numbers in question here are those for the RA. There's > > > > > a multitude of values used in the tables here, including some a lot larger. > > > > > So the overall bumping to 8 certainly was the wrong thing to do and instead > > > > > individual numbers should have been adjusted (didn't look at the history > > > > > of that bumping). > > > > > > > > For reference: > > > > > > > > r125951 | uros | 2007-06-22 19:51:06 +0200 (Fri, 22 Jun 2007) | 6 lines > > > > > > > > PR target/32413 > > > > * config/i386/i386.c (ix86_register_move_cost): Rise the cost of > > > > moves between MMX/SSE registers to at least 8 units to prevent > > > > ICE caused by non-tieable SI/HI/QImodes in SSE registers. > > > > > > > > should probably have been "twice the cost of X" or something like that > > > > instead where X be some reg-reg move cost. > > > > > > Thanks for the reference. It looks that the patch fixes the issue in > > > the wrong place, this should be solved in > > > inline_secondary_memory_needed: > > > > > > /* Between SSE and general, we have moves no larger than word size. */ > > > if (!(INTEGER_CLASS_P (class1) || INTEGER_CLASS_P (class2)) > > > || GET_MODE_SIZE (mode) < GET_MODE_SIZE (SImode) > > > || GET_MODE_SIZE (mode) > UNITS_PER_WORD) > > > return true; > > > > > > as an alternative to implement QI and HImode moves as a SImode move > > > between SSE and int<->SSE registers. We have > > > ix86_secondary_memory_needed_mode that extends QI and HImode secondary > > > memory to SImode, so this should solve PR32413. > > > > > > Other than that, what to do with the bizzare property of direct moves > > > that benchmark far worse than indirect moves? I was expecting that > > > keeping the cost of direct inter-regset moves just a bit below the > > > cost of int<->mem<->xmm, but (much ?) higher than itra-regset moves > > > would prevent unwanted wandering of values between register sets, > > > while still generating the direct move when needed. While this almost > > > > I've not tested it yet. > > So i'll start a test about this patch(change cost from 2-->6) with > > Richard's change. > > I'll keep you informed when finishing test. > > > > > fixes the runtime regression, it is not clear to me from Hongtao Liu's > > > message if Richard's 2019-08-27 fixes the remaining regression or > > > not). Liu, can you please clarify? > > > > > -------------------------------- > > 531.deepsjeng_r -7.18% > > 548.exchange_r -6.70% > > 557.xz_r -6.74% > > 508.namd_r -2.81% > > 527.cam4_r -6.48% > > 544.nab_r -3.99% > > > > Tested on skylake server. > > ------------------------------------- > > Those regressions are comparing gcc10_20190830 to gcc10_20190824 which > > are mainly caused by removing limit of 8. > > > > > > > For example Pentium4 has quite high bases for move > > > > > costs, like xmm <-> xmm move costing 12 and SSE->integer costing 20 > > > > > while the opposite 12. > > > > > > > > > > So yes, we want to revert the patch by applying its effect to the > > > > > individual cost tables so we can revisit this for the still interesting > > > > > micro-architectures. > > > > > > Uros. > > > > > > > > -- > > BR, > > Hongtao > > Change cost from 2->6 got > ------------- > 531.deepsjeng_r 9.64% > 548.exchange_r 10.24% > 557.xc_r 7.99% > 508.namd_r 1.08% > 527.cam4_r 6.91% > 553.nab_r 3.06% > ------------ > > for 531,548,557,527, even better comparing to version before regression. > for 508,533, still little regressions comparing to version before regression. Good, that brings us into "noise" region. Based on these results and other findings, I propose the following solution: - The inter-regset move costs of architectures, that have been defined before r125951 remain the same. These are: size, i386, i486, pentium, pentiumpro, geode, k6, athlon, k8, amdfam10, pentium4 and nocona. - bdver, btver1 and btver2 have costs higher than 8, so they are not affected. - lakemont, znver1, znver2, atom, slm, intel and generic costs have inter-regset costs above intra-regset and below or equal memory load/store cost, should remain as they are. Additionally, intel and generic costs are regularly re-tuned. - only skylake and core costs remain problematic So, I propose to raise XMM<->intreg costs of skylake and core architectures to 6 to solve the regression. These can be fine-tuned later, we are now able to change the cost for RA independently of RTX costs. Also, the RA cost can be asymmetrical. Attached patch implements the proposal. If there are no other proposals or discussions, I plan to commit it on Friday. Uros. diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index 3381b8bf143c..00edece3eb68 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -1610,7 +1610,7 @@ struct processor_costs skylake_cost = { in 32,64,128,256 and 512-bit */ {8, 8, 8, 12, 24}, /* cost of storing SSE registers in 32,64,128,256 and 512-bit */ - 2, 2, /* SSE->integer and integer->SSE moves */ + 6, 6, /* SSE->integer and integer->SSE moves */ /* End of register allocator costs. */ }, @@ -2555,7 +2555,7 @@ struct processor_costs core_cost = { in 32,64,128,256 and 512-bit */ {6, 6, 6, 6, 12}, /* cost of storing SSE registers in 32,64,128,256 and 512-bit */ - 2, 2, /* SSE->integer and integer->SSE moves */ + 6, 6, /* SSE->integer and integer->SSE moves */ /* End of register allocator costs. */ },
diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c index 6eb1482a9f01..766ec355e07b 100644 --- a/gcc/config/i386/i386-features.c +++ b/gcc/config/i386/i386-features.c @@ -549,11 +549,18 @@ general_scalar_chain::compute_convert_gain () || GET_CODE (src) == ASHIFTRT || GET_CODE (src) == LSHIFTRT) { + if (m == 2) + { + if (INTVAL (XEXP (src, 1)) >= 32) + igain += ix86_cost->add; + else + igain += ix86_cost->shift_const; + } + + igain += ix86_cost->shift_const - ix86_cost->sse_op; + if (CONST_INT_P (XEXP (src, 0))) igain -= vector_const_cost (XEXP (src, 0)); - igain += m * ix86_cost->shift_const - ix86_cost->sse_op; - if (INTVAL (XEXP (src, 1)) >= 32) - igain -= COSTS_N_INSNS (1); } else if (GET_CODE (src) == PLUS || GET_CODE (src) == MINUS @@ -1325,7 +1332,7 @@ general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode) case ASHIFT: case LSHIFTRT: if (!CONST_INT_P (XEXP (src, 1)) - || !IN_RANGE (INTVAL (XEXP (src, 1)), 0, 63)) + || !IN_RANGE (INTVAL (XEXP (src, 1)), 0, GET_MODE_BITSIZE (mode)-1)) return false; break;