[i386] : Improve STV conversion of shifts

Message ID	CAFULd4Z88+aey62UENVeSQCzCx+ev7-AYbCgW-ox63qa7R6TtA@mail.gmail.com
State	New
Headers	show Return-Path: <gcc-patches-return-507855-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :mime-version:from:date:message-id:subject:to:content-type; q= dns; s=default; b=NfApWpUyTnJz+/kHUHHImmDYykOT2GZbVNFKouUSBxekU8 holilS4IFiEKWSIyqJ060fy9bm17KdLTrNv58UbOj/ta5w+HMp4fPY/bxQO71iaA SEXUZrQX21noEcIGDbR9E3SalnpLqUirPx18HsxHAt85AutvHCghntS9CloO8= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org MIME-Version: 1.0 From: Uros Bizjak <ubizjak@gmail.com> Date: Wed, 28 Aug 2019 17:12:52 +0200 Message-ID: <CAFULd4Z88+aey62UENVeSQCzCx+ev7-AYbCgW-ox63qa7R6TtA@mail.gmail.com> Subject: [PATCH, i386]: Improve STV conversion of shifts To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org> Content-Type: multipart/mixed; boundary="0000000000005cf0cf05912ed313"
Series	[i386] : Improve STV conversion of shifts \| expand [i386] : Improve STV conversion of shifts

Uros Bizjak Aug. 28, 2019, 3:12 p.m. UTC

Attached patch improves costing for STV shifts and corrects reject
condition for out of range shift count operands.

2019-08-28  Uroš Bizjak  <ubizjak@gmail.com>

    * config/i386/i386-features.c
    (general_scalar_chain::compute_convert_gain):
    Correct cost for double-word shifts.
    (general_scalar_to_vector_candidate_p): Reject count operands
    greater or equal to mode bitsize.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Committed to mainline SVN.

Uros.

Uros Bizjak Aug. 29, 2019, 7:53 a.m. UTC | #1

On Wed, Aug 28, 2019 at 5:12 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> Attached patch improves costing for STV shifts and corrects reject
> condition for out of range shift count operands.
>
> 2019-08-28  Uroš Bizjak  <ubizjak@gmail.com>
>
>     * config/i386/i386-features.c
>     (general_scalar_chain::compute_convert_gain):
>     Correct cost for double-word shifts.
>     (general_scalar_to_vector_candidate_p): Reject count operands
>     greater or equal to mode bitsize.
>
> Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
>
> Committed to mainline SVN.

Ouch... I mixed up patches and actually committed the patch that
removes maximum from cost of sse<->int moves.

I can leave the patch for a day, so we can see the effects of the cost
change, and if the patch creates problems, I'll revert it.

Sorry for the mixup,
Uros.

Uros Bizjak Aug. 29, 2019, 10 a.m. UTC | #2

As usual with costing changes, the patch exposes latent problem. The
patched compiler tries to generate non-existing DImode move from mask
register to XMM register, and ICEs during reload [1]. Attached patch
tightens secondary_reload_needed condition and fixes the issue.

I'm bootstrapping and regression testing patch, and will submit a
formal submission later today.

[1] https://gcc.gnu.org/ml/gcc-regression/2019-08/msg00537.html

Uros.

On Thu, Aug 29, 2019 at 9:53 AM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Wed, Aug 28, 2019 at 5:12 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > Attached patch improves costing for STV shifts and corrects reject
> > condition for out of range shift count operands.
> >
> > 2019-08-28  Uroš Bizjak  <ubizjak@gmail.com>
> >
> >     * config/i386/i386-features.c
> >     (general_scalar_chain::compute_convert_gain):
> >     Correct cost for double-word shifts.
> >     (general_scalar_to_vector_candidate_p): Reject count operands
> >     greater or equal to mode bitsize.
> >
> > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
> >
> > Committed to mainline SVN.
>
> Ouch... I mixed up patches and actually committed the patch that
> removes maximum from cost of sse<->int moves.
>
> I can leave the patch for a day, so we can see the effects of the cost
> change, and if the patch creates problems, I'll revert it.
>
> Sorry for the mixup,
> Uros.

Uros Bizjak Aug. 29, 2019, 10 a.m. UTC | #3

And the patch...

On Thu, Aug 29, 2019 at 12:00 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> As usual with costing changes, the patch exposes latent problem. The
> patched compiler tries to generate non-existing DImode move from mask
> register to XMM register, and ICEs during reload [1]. Attached patch
> tightens secondary_reload_needed condition and fixes the issue.
>
> I'm bootstrapping and regression testing patch, and will submit a
> formal submission later today.
>
> [1] https://gcc.gnu.org/ml/gcc-regression/2019-08/msg00537.html
>
> Uros.
>
> On Thu, Aug 29, 2019 at 9:53 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > On Wed, Aug 28, 2019 at 5:12 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> > >
> > > Attached patch improves costing for STV shifts and corrects reject
> > > condition for out of range shift count operands.
> > >
> > > 2019-08-28  Uroš Bizjak  <ubizjak@gmail.com>
> > >
> > >     * config/i386/i386-features.c
> > >     (general_scalar_chain::compute_convert_gain):
> > >     Correct cost for double-word shifts.
> > >     (general_scalar_to_vector_candidate_p): Reject count operands
> > >     greater or equal to mode bitsize.
> > >
> > > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
> > >
> > > Committed to mainline SVN.
> >
> > Ouch... I mixed up patches and actually committed the patch that
> > removes maximum from cost of sse<->int moves.
> >
> > I can leave the patch for a day, so we can see the effects of the cost
> > change, and if the patch creates problems, I'll revert it.
> >
> > Sorry for the mixup,
> > Uros.
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index d2d84eb11663..1c9c719f22a3 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -18306,32 +18306,36 @@ inline_secondary_memory_needed (machine_mode mode, reg_class_t class1,
   if (FLOAT_CLASS_P (class1) != FLOAT_CLASS_P (class2))
     return true;
 
-  /* Between mask and general, we have moves no larger than word size.  */
-  if ((MASK_CLASS_P (class1) != MASK_CLASS_P (class2))
-      && (GET_MODE_SIZE (mode) > UNITS_PER_WORD))
-  return true;
-
   /* ??? This is a lie.  We do have moves between mmx/general, and for
      mmx/sse2.  But by saying we need secondary memory we discourage the
      register allocator from using the mmx registers unless needed.  */
   if (MMX_CLASS_P (class1) != MMX_CLASS_P (class2))
     return true;
 
+  /* Between mask and general, we have moves no larger than word size.  */
+  if (MASK_CLASS_P (class1) != MASK_CLASS_P (class2))
+    {
+      if (!(INTEGER_CLASS_P (class1) || INTEGER_CLASS_P (class2))
+	  || GET_MODE_SIZE (mode) > UNITS_PER_WORD)
+	return true;
+    }
+
   if (SSE_CLASS_P (class1) != SSE_CLASS_P (class2))
     {
       /* SSE1 doesn't have any direct moves from other classes.  */
       if (!TARGET_SSE2)
 	return true;
 
+      /* Between SSE and general, we have moves no larger than word size.  */
+      if (!(INTEGER_CLASS_P (class1) || INTEGER_CLASS_P (class2))
+	  || GET_MODE_SIZE (mode) > UNITS_PER_WORD)
+	return true;
+
       /* If the target says that inter-unit moves are more expensive
 	 than moving through memory, then don't generate them.  */
       if ((SSE_CLASS_P (class1) && !TARGET_INTER_UNIT_MOVES_FROM_VEC)
 	  || (SSE_CLASS_P (class2) && !TARGET_INTER_UNIT_MOVES_TO_VEC))
 	return true;
-
-      /* Between SSE and general, we have moves no larger than word size.  */
-      if (GET_MODE_SIZE (mode) > UNITS_PER_WORD)
-	return true;
     }
 
   return false;
@@ -18608,15 +18612,7 @@ ix86_register_move_cost (machine_mode mode, reg_class_t class1_i,
   if (MMX_CLASS_P (class1) != MMX_CLASS_P (class2))
     gcc_unreachable ();
 
-  /* Moves between SSE and integer units are expensive.  */
   if (SSE_CLASS_P (class1) != SSE_CLASS_P (class2))
-
-    /* ??? By keeping returned value relatively high, we limit the number
-       of moves between integer and SSE registers for all targets.
-       Additionally, high value prevents problem with x86_modes_tieable_p(),
-       where integer modes in SSE registers are not tieable
-       because of missing QImode and HImode moves to, from or between
-       MMX/SSE registers.  */
     return (SSE_CLASS_P (class1)
 	    ? ix86_cost->hard_register.sse_to_integer
 	    : ix86_cost->hard_register.integer_to_sse);

Richard Biener Aug. 30, 2019, 7:22 a.m. UTC | #4

On Thu, Aug 29, 2019 at 9:54 AM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Wed, Aug 28, 2019 at 5:12 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > Attached patch improves costing for STV shifts and corrects reject
> > condition for out of range shift count operands.
> >
> > 2019-08-28  Uroš Bizjak  <ubizjak@gmail.com>
> >
> >     * config/i386/i386-features.c
> >     (general_scalar_chain::compute_convert_gain):
> >     Correct cost for double-word shifts.
> >     (general_scalar_to_vector_candidate_p): Reject count operands
> >     greater or equal to mode bitsize.
> >
> > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
> >
> > Committed to mainline SVN.
>
> Ouch... I mixed up patches and actually committed the patch that
> removes maximum from cost of sse<->int moves.
>
> I can leave the patch for a day, so we can see the effects of the cost
> change, and if the patch creates problems, I'll revert it.

Regresses gromacs and namd quite a bit on Haswell, also perl in SPEC 2000.
I guess we should try understand why rather than reverting immediately
(I'd leave it in at least another few days to get more testers pick up the
rev.).

Richard.

> Sorry for the mixup,
> Uros.

Uros Bizjak Aug. 30, 2019, 7:42 a.m. UTC | #5

On Fri, Aug 30, 2019 at 9:22 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Thu, Aug 29, 2019 at 9:54 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > On Wed, Aug 28, 2019 at 5:12 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> > >
> > > Attached patch improves costing for STV shifts and corrects reject
> > > condition for out of range shift count operands.
> > >
> > > 2019-08-28  Uroš Bizjak  <ubizjak@gmail.com>
> > >
> > >     * config/i386/i386-features.c
> > >     (general_scalar_chain::compute_convert_gain):
> > >     Correct cost for double-word shifts.
> > >     (general_scalar_to_vector_candidate_p): Reject count operands
> > >     greater or equal to mode bitsize.
> > >
> > > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
> > >
> > > Committed to mainline SVN.
> >
> > Ouch... I mixed up patches and actually committed the patch that
> > removes maximum from cost of sse<->int moves.
> >
> > I can leave the patch for a day, so we can see the effects of the cost
> > change, and if the patch creates problems, I'll revert it.
>
> Regresses gromacs and namd quite a bit on Haswell, also perl in SPEC 2000.
> I guess we should try understand why rather than reverting immediately
> (I'd leave it in at least another few days to get more testers pick up the
> rev.).

The correct patch is actually [1] (I have updated the subject):

2019-08-28  Uroš Bizjak  <ubizjak@gmail.com>

    * config/i386/i386.c (ix86_register_move_cost): Do not
    limit the cost of moves to/from XMM register to minimum 8.

There is no technical reason to limit the costs to minimum 8 anymore,
and several targets (e.g skylake) also claim that moves between SSE
and general regs are as cheap as moves between general regs. However,
these values were never benchmarked properly due to the mentioned
limitation and apparently cause some unwanted secondary effects
(lowering the clock).

I agree with your proposal to leave the change in the tree some more
time. At the end, we could raise the costs for all targets to 8 to
effectively revert to the previous state.

[1] https://gcc.gnu.org/ml/gcc-patches/2019-08/msg02009.html

Uros.

Alan Modra Aug. 31, 2019, 12:51 a.m. UTC | #6

On Fri, Aug 30, 2019 at 09:42:06AM +0200, Uros Bizjak wrote:
> On Fri, Aug 30, 2019 at 9:22 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Thu, Aug 29, 2019 at 9:54 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> > >
> > > On Wed, Aug 28, 2019 at 5:12 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> > > >
> > > > Attached patch improves costing for STV shifts and corrects reject
> > > > condition for out of range shift count operands.
> > > >
> > > > 2019-08-28  Uroš Bizjak  <ubizjak@gmail.com>
> > > >
> > > >     * config/i386/i386-features.c
> > > >     (general_scalar_chain::compute_convert_gain):
> > > >     Correct cost for double-word shifts.
> > > >     (general_scalar_to_vector_candidate_p): Reject count operands
> > > >     greater or equal to mode bitsize.
> > > >
> > > > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
> > > >
> > > > Committed to mainline SVN.
> > >
> > > Ouch... I mixed up patches and actually committed the patch that
> > > removes maximum from cost of sse<->int moves.
> > >
> > > I can leave the patch for a day, so we can see the effects of the cost
> > > change, and if the patch creates problems, I'll revert it.
> >
> > Regresses gromacs and namd quite a bit on Haswell, also perl in SPEC 2000.
> > I guess we should try understand why rather than reverting immediately
> > (I'd leave it in at least another few days to get more testers pick up the
> > rev.).
> 
> The correct patch is actually [1] (I have updated the subject):
> 
> 2019-08-28  Uroš Bizjak  <ubizjak@gmail.com>
> 
>     * config/i386/i386.c (ix86_register_move_cost): Do not
>     limit the cost of moves to/from XMM register to minimum 8.
> 
> There is no technical reason to limit the costs to minimum 8 anymore,
> and several targets (e.g skylake) also claim that moves between SSE
> and general regs are as cheap as moves between general regs. However,
> these values were never benchmarked properly due to the mentioned
> limitation and apparently cause some unwanted secondary effects
> (lowering the clock).
> 
> I agree with your proposal to leave the change in the tree some more
> time. At the end, we could raise the costs for all targets to 8 to
> effectively revert to the previous state.
> 
> [1] https://gcc.gnu.org/ml/gcc-patches/2019-08/msg02009.html

Those SPEC regressions sound similar to what I saw when trying to give
proper costing to moves between general regs and vsx regs on Power9.
rs6000.c:rs6000_ira_change_pseudo_allocno_class is the hack I used to
work around bad register allocation decisions due to poor register
pressure calculation.

Richard Biener Aug. 31, 2019, 1:57 p.m. UTC | #7

On August 31, 2019 2:51:51 AM GMT+02:00, Alan Modra <amodra@gmail.com> wrote:
>On Fri, Aug 30, 2019 at 09:42:06AM +0200, Uros Bizjak wrote:
>> On Fri, Aug 30, 2019 at 9:22 AM Richard Biener
>> <richard.guenther@gmail.com> wrote:
>> >
>> > On Thu, Aug 29, 2019 at 9:54 AM Uros Bizjak <ubizjak@gmail.com>
>wrote:
>> > >
>> > > On Wed, Aug 28, 2019 at 5:12 PM Uros Bizjak <ubizjak@gmail.com>
>wrote:
>> > > >
>> > > > Attached patch improves costing for STV shifts and corrects
>reject
>> > > > condition for out of range shift count operands.
>> > > >
>> > > > 2019-08-28  Uroš Bizjak  <ubizjak@gmail.com>
>> > > >
>> > > >     * config/i386/i386-features.c
>> > > >     (general_scalar_chain::compute_convert_gain):
>> > > >     Correct cost for double-word shifts.
>> > > >     (general_scalar_to_vector_candidate_p): Reject count
>operands
>> > > >     greater or equal to mode bitsize.
>> > > >
>> > > > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
>> > > >
>> > > > Committed to mainline SVN.
>> > >
>> > > Ouch... I mixed up patches and actually committed the patch that
>> > > removes maximum from cost of sse<->int moves.
>> > >
>> > > I can leave the patch for a day, so we can see the effects of the
>cost
>> > > change, and if the patch creates problems, I'll revert it.
>> >
>> > Regresses gromacs and namd quite a bit on Haswell, also perl in
>SPEC 2000.
>> > I guess we should try understand why rather than reverting
>immediately
>> > (I'd leave it in at least another few days to get more testers pick
>up the
>> > rev.).
>> 
>> The correct patch is actually [1] (I have updated the subject):
>> 
>> 2019-08-28  Uroš Bizjak  <ubizjak@gmail.com>
>> 
>>     * config/i386/i386.c (ix86_register_move_cost): Do not
>>     limit the cost of moves to/from XMM register to minimum 8.
>> 
>> There is no technical reason to limit the costs to minimum 8 anymore,
>> and several targets (e.g skylake) also claim that moves between SSE
>> and general regs are as cheap as moves between general regs.

That's likely the issue since moves between regs in the same class can be usually dealt with in register renaming while between classes that is not generally possible. So I'd make those bigger cost. Otoh moves between same class regs are likely overcounted. 

 However,
>> these values were never benchmarked properly due to the mentioned
>> limitation and apparently cause some unwanted secondary effects
>> (lowering the clock).
>> 
>> I agree with your proposal to leave the change in the tree some more
>> time. At the end, we could raise the costs for all targets to 8 to
>> effectively revert to the previous state.
>> 
>> [1] https://gcc.gnu.org/ml/gcc-patches/2019-08/msg02009.html
>
>Those SPEC regressions sound similar to what I saw when trying to give
>proper costing to moves between general regs and vsx regs on Power9.
>rs6000.c:rs6000_ira_change_pseudo_allocno_class is the hack I used to
>work around bad register allocation decisions due to poor register
>pressure calculation.

Uros Bizjak Sept. 1, 2019, 7:48 p.m. UTC | #8

On Sat, Aug 31, 2019 at 3:57 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On August 31, 2019 2:51:51 AM GMT+02:00, Alan Modra <amodra@gmail.com> wrote:
> >On Fri, Aug 30, 2019 at 09:42:06AM +0200, Uros Bizjak wrote:
> >> On Fri, Aug 30, 2019 at 9:22 AM Richard Biener
> >> <richard.guenther@gmail.com> wrote:
> >> >
> >> > On Thu, Aug 29, 2019 at 9:54 AM Uros Bizjak <ubizjak@gmail.com>
> >wrote:
> >> > >
> >> > > On Wed, Aug 28, 2019 at 5:12 PM Uros Bizjak <ubizjak@gmail.com>
> >wrote:
> >> > > >
> >> > > > Attached patch improves costing for STV shifts and corrects
> >reject
> >> > > > condition for out of range shift count operands.
> >> > > >
> >> > > > 2019-08-28  Uroš Bizjak  <ubizjak@gmail.com>
> >> > > >
> >> > > >     * config/i386/i386-features.c
> >> > > >     (general_scalar_chain::compute_convert_gain):
> >> > > >     Correct cost for double-word shifts.
> >> > > >     (general_scalar_to_vector_candidate_p): Reject count
> >operands
> >> > > >     greater or equal to mode bitsize.
> >> > > >
> >> > > > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
> >> > > >
> >> > > > Committed to mainline SVN.
> >> > >
> >> > > Ouch... I mixed up patches and actually committed the patch that
> >> > > removes maximum from cost of sse<->int moves.
> >> > >
> >> > > I can leave the patch for a day, so we can see the effects of the
> >cost
> >> > > change, and if the patch creates problems, I'll revert it.
> >> >
> >> > Regresses gromacs and namd quite a bit on Haswell, also perl in
> >SPEC 2000.
> >> > I guess we should try understand why rather than reverting
> >immediately
> >> > (I'd leave it in at least another few days to get more testers pick
> >up the
> >> > rev.).
> >>
> >> The correct patch is actually [1] (I have updated the subject):
> >>
> >> 2019-08-28  Uroš Bizjak  <ubizjak@gmail.com>
> >>
> >>     * config/i386/i386.c (ix86_register_move_cost): Do not
> >>     limit the cost of moves to/from XMM register to minimum 8.
> >>
> >> There is no technical reason to limit the costs to minimum 8 anymore,
> >> and several targets (e.g skylake) also claim that moves between SSE
> >> and general regs are as cheap as moves between general regs.
>
> That's likely the issue since moves between regs in the same class can be usually dealt with in register renaming while between classes that is not generally possible. So I'd make those bigger cost. Otoh moves between same class regs are likely overcounted.
>
>  However,
> >> these values were never benchmarked properly due to the mentioned
> >> limitation and apparently cause some unwanted secondary effects
> >> (lowering the clock).
> >>
> >> I agree with your proposal to leave the change in the tree some more
> >> time. At the end, we could raise the costs for all targets to 8 to
> >> effectively revert to the previous state.
> >>
> >> [1] https://gcc.gnu.org/ml/gcc-patches/2019-08/msg02009.html
> >
> >Those SPEC regressions sound similar to what I saw when trying to give
> >proper costing to moves between general regs and vsx regs on Power9.
> >rs6000.c:rs6000_ira_change_pseudo_allocno_class is the hack I used to
> >work around bad register allocation decisions due to poor register
> >pressure calculation.

Alan, thanks for the pointer, I take a look at the extensive comments
in and above rs6000_ira_change_pseudo_allocno_class and also inside
rs6000_register_move_cost. According to the comment to
rs6000_i_c_p_a_c:

--quote--
   The register allocator chooses GEN_OR_VSX_REGS for the allocno
   class if GENERAL_REGS and VSX_REGS cost is lower than the memory
   cost.  This happens a lot when TARGET_DIRECT_MOVE makes the register
   move cost between GENERAL_REGS and VSX_REGS low.

   It might seem reasonable to use a union class.  After all, if usage
   of vsr is low and gpr high, it might make sense to spill gpr to vsr
   rather than memory.  However, in cases where register pressure of
   both is high, like the cactus_adm spec test, allowing
   GEN_OR_VSX_REGS as the allocno class results in bad decisions in
   the first scheduling pass.  This is partly due to an allocno of
   GEN_OR_VSX_REGS wrongly contributing to the GENERAL_REGS pressure
   class, which gives too high a pressure for GENERAL_REGS and too low
   for VSX_REGS.  So, force a choice of the subclass here.
--/quote--

we can take a parallel with x86's SSE and integer registers, where
both classes can hold SImode and DImode values. The attached patch is
the first try to implement the idea of forcing a subclass (I must
admit that the patch is a bit of a shot in the dark...).

Also, the comment in the rs6000_register_move_cost says:

          /* Keep the cost for direct moves above that for within
         a register class even if the actual processor cost is
         comparable.  We do this because a direct move insn
         can't be a nop, whereas with ideal register
         allocation a move within the same class might turn
         out to be a nop.  */

which is not the case with core_cost (and similar with skylake_cost):

  2, 2, 4,                /* cost of moving XMM,YMM,ZMM register */
  {6, 6, 6, 6, 12},            /* cost of loading SSE registers
                       in 32,64,128,256 and 512-bit */
  {6, 6, 6, 6, 12},            /* cost of storing SSE registers
                       in 32,64,128,256 and 512-bit */
  2, 2,                    /* SSE->integer and integer->SSE moves */

We have the same cost of moving between integer registers (by default
set to 2), between SSE registers and between integer and SSE register
sets. I think that at least the cost of moves between regsets should
be substantially higher, rs6000 uses 3x cost of intra-regset moves;
that would translate to the value of 6. The value should be low enough
to keep the cost below the value that forces move through the memory.
Changing core register allocation cost of SSE <-> integer to:

--cut here--
Index: config/i386/x86-tune-costs.h
===================================================================
--- config/i386/x86-tune-costs.h        (revision 275281)
+++ config/i386/x86-tune-costs.h        (working copy)
@@ -2555,7 +2555,7 @@ struct processor_costs core_cost = {
                                           in 32,64,128,256 and 512-bit */
   {6, 6, 6, 6, 12},                    /* cost of storing SSE registers
                                           in 32,64,128,256 and 512-bit */
-  2, 2,                                        /* SSE->integer and
integer->SSE moves */
+  6, 6,                                        /* SSE->integer and
integer->SSE moves */
   /* End of register allocator costs.  */
   },

--cut here--

still produces direct move in gcc.target/i386/minmax-6.c

I think that in addition to attached patch, values between 2 and 6
should be considered in benchmarking. Unfortunately, without access to
regressed SPEC tests, I can't analyse these changes by myself.

Uros.
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 275198)
+++ config/i386/i386.c	(working copy)
@@ -18632,6 +18632,25 @@
   return 2;
 }
 
+/* Implement TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS.  */
+
+static reg_class_t
+x86_ira_change_pseudo_allocno_class (int regno ATTRIBUTE_UNUSED,
+				     reg_class_t allocno_class,
+				     reg_class_t best_class)
+{
+  switch (allocno_class)
+    {
+    case INT_SSE_REGS:
+      return best_class;
+
+    default:
+      break;
+    }
+
+  return allocno_class;
+}
+
 /* Implement TARGET_HARD_REGNO_NREGS.  This is ordinarily the length in
    words of a value of mode MODE but can be less for certain modes in
    special long registers.
@@ -22746,6 +22765,9 @@
 #define TARGET_REGISTER_MOVE_COST ix86_register_move_cost
 #undef TARGET_MEMORY_MOVE_COST
 #define TARGET_MEMORY_MOVE_COST ix86_memory_move_cost
+#undef TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS
+#define TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS \
+  x86_ira_change_pseudo_allocno_class
 #undef TARGET_RTX_COSTS
 #define TARGET_RTX_COSTS ix86_rtx_costs
 #undef TARGET_ADDRESS_COST

Alan Modra Sept. 2, 2019, 7:16 a.m. UTC | #9

On Sun, Sep 01, 2019 at 09:48:49PM +0200, Uros Bizjak wrote:
> the first try to implement the idea of forcing a subclass (I must
> admit that the patch is a bit of a shot in the dark...).

Yes, keep in mind that rs6000_ira_change_pseudo_allocno_class is a
hack, and one that might only be useful with
TARGET_COMPUTE_PRESSURE_CLASSES (and possibly SCHED_PRESSURE_MODEL
too).

Hongtao Liu Sept. 2, 2019, 8:16 a.m. UTC | #10

> which is not the case with core_cost (and similar with skylake_cost):
>
>   2, 2, 4,                /* cost of moving XMM,YMM,ZMM register */
>   {6, 6, 6, 6, 12},            /* cost of loading SSE registers
>                        in 32,64,128,256 and 512-bit */
>   {6, 6, 6, 6, 12},            /* cost of storing SSE registers
>                        in 32,64,128,256 and 512-bit */
>   2, 2,                    /* SSE->integer and integer->SSE moves */
>
> We have the same cost of moving between integer registers (by default
> set to 2), between SSE registers and between integer and SSE register
> sets. I think that at least the cost of moves between regsets should
> be substantially higher, rs6000 uses 3x cost of intra-regset moves;
> that would translate to the value of 6. The value should be low enough
> to keep the cost below the value that forces move through the memory.
> Changing core register allocation cost of SSE <-> integer to:
>
> --cut here--
> Index: config/i386/x86-tune-costs.h
> ===================================================================
> --- config/i386/x86-tune-costs.h        (revision 275281)
> +++ config/i386/x86-tune-costs.h        (working copy)
> @@ -2555,7 +2555,7 @@ struct processor_costs core_cost = {
>                                            in 32,64,128,256 and 512-bit */
>    {6, 6, 6, 6, 12},                    /* cost of storing SSE registers
>                                            in 32,64,128,256 and 512-bit */
> -  2, 2,                                        /* SSE->integer and
> integer->SSE moves */
> +  6, 6,                                        /* SSE->integer and
> integer->SSE moves */
>    /* End of register allocator costs.  */
>    },
>
> --cut here--
>
> still produces direct move in gcc.target/i386/minmax-6.c
>
> I think that in addition to attached patch, values between 2 and 6
> should be considered in benchmarking. Unfortunately, without access to
> regressed SPEC tests, I can't analyse these changes by myself.
>
> Uros.

Apply similar change to skylake_cost, on skylake workstation we got
performance like:
---------------------------
version                                                            |
548_exchange_r score
gcc10_20180822:                                           |   10
apply remove_max8                                       |   8.9
also apply increase integer_tofrom_sse cost |   9.69
-----------------------------
Still 3% regression which is related to _gfortran_mminloc0_4_i4 in
libgfortran.so.5.0.0.

I found suspicious code as bellow, does it affect?
------------------
modified   gcc/config/i386/i386-features.c
@@ -590,7 +590,7 @@ general_scalar_chain::compute_convert_gain ()
   if (dump_file)
     fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);

-  /* ???  What about integer to SSE?  */
+  /* ???  What about integer to SSE?  */???
   EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
     cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
------------------

Uros Bizjak Sept. 2, 2019, 8:41 a.m. UTC | #11

On Mon, Sep 2, 2019 at 10:13 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> > which is not the case with core_cost (and similar with skylake_cost):
> >
> >   2, 2, 4,                /* cost of moving XMM,YMM,ZMM register */
> >   {6, 6, 6, 6, 12},            /* cost of loading SSE registers
> >                        in 32,64,128,256 and 512-bit */
> >   {6, 6, 6, 6, 12},            /* cost of storing SSE registers
> >                        in 32,64,128,256 and 512-bit */
> >   2, 2,                    /* SSE->integer and integer->SSE moves */
> >
> > We have the same cost of moving between integer registers (by default
> > set to 2), between SSE registers and between integer and SSE register
> > sets. I think that at least the cost of moves between regsets should
> > be substantially higher, rs6000 uses 3x cost of intra-regset moves;
> > that would translate to the value of 6. The value should be low enough
> > to keep the cost below the value that forces move through the memory.
> > Changing core register allocation cost of SSE <-> integer to:
> >
> > --cut here--
> > Index: config/i386/x86-tune-costs.h
> > ===================================================================
> > --- config/i386/x86-tune-costs.h        (revision 275281)
> > +++ config/i386/x86-tune-costs.h        (working copy)
> > @@ -2555,7 +2555,7 @@ struct processor_costs core_cost = {
> >                                            in 32,64,128,256 and 512-bit */
> >    {6, 6, 6, 6, 12},                    /* cost of storing SSE registers
> >                                            in 32,64,128,256 and 512-bit */
> > -  2, 2,                                        /* SSE->integer and
> > integer->SSE moves */
> > +  6, 6,                                        /* SSE->integer and
> > integer->SSE moves */
> >    /* End of register allocator costs.  */
> >    },
> >
> > --cut here--
> >
> > still produces direct move in gcc.target/i386/minmax-6.c
> >
> > I think that in addition to attached patch, values between 2 and 6
> > should be considered in benchmarking. Unfortunately, without access to
> > regressed SPEC tests, I can't analyse these changes by myself.
> >
> > Uros.
>
> Apply similar change to skylake_cost, on skylake workstation we got
> performance like:
> ---------------------------
> version                                                            |
> 548_exchange_r score
> gcc10_20180822:                                           |   10
> apply remove_max8                                       |   8.9
> also apply increase integer_tofrom_sse cost |   9.69
> -----------------------------
> Still 3% regression which is related to _gfortran_mminloc0_4_i4 in
> libgfortran.so.5.0.0.
>
> I found suspicious code as bellow, does it affect?

Hard to say without access to the test, but I'm glad that changing the
knob has noticeable effect. I think that (as said by Alan) a fine-tune
of register pressure calculation will be needed to push this forward.

Uros.

> ------------------
> modified   gcc/config/i386/i386-features.c
> @@ -590,7 +590,7 @@ general_scalar_chain::compute_convert_gain ()
>    if (dump_file)
>      fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
>
> -  /* ???  What about integer to SSE?  */
> +  /* ???  What about integer to SSE?  */???
>    EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
>      cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
> ------------------
> --
> BR,
> Hongtao

Richard Biener Sept. 2, 2019, 10:22 a.m. UTC | #12

On Mon, Sep 2, 2019 at 10:13 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> > which is not the case with core_cost (and similar with skylake_cost):
> >
> >   2, 2, 4,                /* cost of moving XMM,YMM,ZMM register */
> >   {6, 6, 6, 6, 12},            /* cost of loading SSE registers
> >                        in 32,64,128,256 and 512-bit */
> >   {6, 6, 6, 6, 12},            /* cost of storing SSE registers
> >                        in 32,64,128,256 and 512-bit */
> >   2, 2,                    /* SSE->integer and integer->SSE moves */
> >
> > We have the same cost of moving between integer registers (by default
> > set to 2), between SSE registers and between integer and SSE register
> > sets. I think that at least the cost of moves between regsets should
> > be substantially higher, rs6000 uses 3x cost of intra-regset moves;
> > that would translate to the value of 6. The value should be low enough
> > to keep the cost below the value that forces move through the memory.
> > Changing core register allocation cost of SSE <-> integer to:
> >
> > --cut here--
> > Index: config/i386/x86-tune-costs.h
> > ===================================================================
> > --- config/i386/x86-tune-costs.h        (revision 275281)
> > +++ config/i386/x86-tune-costs.h        (working copy)
> > @@ -2555,7 +2555,7 @@ struct processor_costs core_cost = {
> >                                            in 32,64,128,256 and 512-bit */
> >    {6, 6, 6, 6, 12},                    /* cost of storing SSE registers
> >                                            in 32,64,128,256 and 512-bit */
> > -  2, 2,                                        /* SSE->integer and
> > integer->SSE moves */
> > +  6, 6,                                        /* SSE->integer and
> > integer->SSE moves */
> >    /* End of register allocator costs.  */
> >    },
> >
> > --cut here--
> >
> > still produces direct move in gcc.target/i386/minmax-6.c
> >
> > I think that in addition to attached patch, values between 2 and 6
> > should be considered in benchmarking. Unfortunately, without access to
> > regressed SPEC tests, I can't analyse these changes by myself.
> >
> > Uros.
>
> Apply similar change to skylake_cost, on skylake workstation we got
> performance like:
> ---------------------------
> version                                                            |
> 548_exchange_r score
> gcc10_20180822:                                           |   10
> apply remove_max8                                       |   8.9
> also apply increase integer_tofrom_sse cost |   9.69
> -----------------------------
> Still 3% regression which is related to _gfortran_mminloc0_4_i4 in
> libgfortran.so.5.0.0.
>
> I found suspicious code as bellow, does it affect?

This should be fixed after

2019-08-27  Richard Biener  <rguenther@suse.de>

        * config/i386/i386-features.h
        (general_scalar_chain::~general_scalar_chain): Add.
        (general_scalar_chain::insns_conv): New bitmap.
        (general_scalar_chain::n_sse_to_integer): New.
        (general_scalar_chain::n_integer_to_sse): Likewise.
        (general_scalar_chain::make_vector_copies): Adjust signature.
        * config/i386/i386-features.c
        (general_scalar_chain::general_scalar_chain): Outline,
        initialize new members.
        (general_scalar_chain::~general_scalar_chain): New.
        (general_scalar_chain::mark_dual_mode_def): Record insns
        we need to insert conversions at and count them.
        (general_scalar_chain::compute_convert_gain): Account
        for conversion instructions at chain boundary.
        (general_scalar_chain::make_vector_copies): Generate a single
        copy for a def by a specific insn.
        (general_scalar_chain::convert_registers): First populate
        defs_map, then make copies at out-of chain insns.

where the only ???  is that we have

  const int sse_to_integer;     /* cost of moving SSE register to integer.  */

but not integer_to_sse.  In the hard_register sub-struct of processor_cost
we have both:

      const int sse_to_integer; /* cost of moving SSE register to integer.  */
      const int integer_to_sse; /* cost of moving integer register to SSE. */

IMHO that we have mostly the same kind of costs two times is odd.
And the compute_convert_gain function adds up apples and oranges.

> ------------------
> modified   gcc/config/i386/i386-features.c
> @@ -590,7 +590,7 @@ general_scalar_chain::compute_convert_gain ()
>    if (dump_file)
>      fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
>
> -  /* ???  What about integer to SSE?  */
> +  /* ???  What about integer to SSE?  */???
>    EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
>      cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
> ------------------
> --
> BR,
> Hongtao

Hongtao Liu Sept. 3, 2019, 1:40 a.m. UTC | #13

On Mon, Sep 2, 2019 at 6:23 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Mon, Sep 2, 2019 at 10:13 AM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > > which is not the case with core_cost (and similar with skylake_cost):
> > >
> > >   2, 2, 4,                /* cost of moving XMM,YMM,ZMM register */
> > >   {6, 6, 6, 6, 12},            /* cost of loading SSE registers
> > >                        in 32,64,128,256 and 512-bit */
> > >   {6, 6, 6, 6, 12},            /* cost of storing SSE registers
> > >                        in 32,64,128,256 and 512-bit */
> > >   2, 2,                    /* SSE->integer and integer->SSE moves */
> > >
> > > We have the same cost of moving between integer registers (by default
> > > set to 2), between SSE registers and between integer and SSE register
> > > sets. I think that at least the cost of moves between regsets should
> > > be substantially higher, rs6000 uses 3x cost of intra-regset moves;
> > > that would translate to the value of 6. The value should be low enough
> > > to keep the cost below the value that forces move through the memory.
> > > Changing core register allocation cost of SSE <-> integer to:
> > >
> > > --cut here--
> > > Index: config/i386/x86-tune-costs.h
> > > ===================================================================
> > > --- config/i386/x86-tune-costs.h        (revision 275281)
> > > +++ config/i386/x86-tune-costs.h        (working copy)
> > > @@ -2555,7 +2555,7 @@ struct processor_costs core_cost = {
> > >                                            in 32,64,128,256 and 512-bit */
> > >    {6, 6, 6, 6, 12},                    /* cost of storing SSE registers
> > >                                            in 32,64,128,256 and 512-bit */
> > > -  2, 2,                                        /* SSE->integer and
> > > integer->SSE moves */
> > > +  6, 6,                                        /* SSE->integer and
> > > integer->SSE moves */
> > >    /* End of register allocator costs.  */
> > >    },
> > >
> > > --cut here--
> > >
> > > still produces direct move in gcc.target/i386/minmax-6.c
> > >
> > > I think that in addition to attached patch, values between 2 and 6
> > > should be considered in benchmarking. Unfortunately, without access to
> > > regressed SPEC tests, I can't analyse these changes by myself.
> > >
> > > Uros.
> >
> > Apply similar change to skylake_cost, on skylake workstation we got
> > performance like:
> > ---------------------------
> > version                                                            |
> > 548_exchange_r score
> > gcc10_20180822:                                           |   10
> > apply remove_max8                                       |   8.9
> > also apply increase integer_tofrom_sse cost |   9.69
> > -----------------------------
> > Still 3% regression which is related to _gfortran_mminloc0_4_i4 in
> > libgfortran.so.5.0.0.
> >
> > I found suspicious code as bellow, does it affect?
>
> This should be fixed after
>
> 2019-08-27  Richard Biener  <rguenther@suse.de>
>
>         * config/i386/i386-features.h
>         (general_scalar_chain::~general_scalar_chain): Add.
>         (general_scalar_chain::insns_conv): New bitmap.
>         (general_scalar_chain::n_sse_to_integer): New.
>         (general_scalar_chain::n_integer_to_sse): Likewise.
>         (general_scalar_chain::make_vector_copies): Adjust signature.
>         * config/i386/i386-features.c
>         (general_scalar_chain::general_scalar_chain): Outline,
>         initialize new members.
>         (general_scalar_chain::~general_scalar_chain): New.
>         (general_scalar_chain::mark_dual_mode_def): Record insns
>         we need to insert conversions at and count them.
>         (general_scalar_chain::compute_convert_gain): Account
>         for conversion instructions at chain boundary.
>         (general_scalar_chain::make_vector_copies): Generate a single
>         copy for a def by a specific insn.
>         (general_scalar_chain::convert_registers): First populate
>         defs_map, then make copies at out-of chain insns.
>
> where the only ???  is that we have
>
>   const int sse_to_integer;     /* cost of moving SSE register to integer.  */
>
> but not integer_to_sse.  In the hard_register sub-struct of processor_cost
Yes.
> we have both:
>
>       const int sse_to_integer; /* cost of moving SSE register to integer.  */
>       const int integer_to_sse; /* cost of moving integer register to SSE. */
>
> IMHO that we have mostly the same kind of costs two times is odd.
They are used for different purposes(one for register allocation, one
for rtx_cost).
Changing cost for register allocation shouldn't affect rtx_cost which
would be used
somewhere else.
> And the compute_convert_gain function adds up apples and oranges.
>
> > ------------------
> > modified   gcc/config/i386/i386-features.c
> > @@ -590,7 +590,7 @@ general_scalar_chain::compute_convert_gain ()
> >    if (dump_file)
> >      fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
> >
> > -  /* ???  What about integer to SSE?  */
> > +  /* ???  What about integer to SSE?  */???
> >    EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
> >      cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
> > ------------------
> > --
> > BR,
> > Hongtao



--
BR,
Hongtao

Hongtao Liu Sept. 3, 2019, 8 a.m. UTC | #14

On Mon, Sep 2, 2019 at 4:41 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Mon, Sep 2, 2019 at 10:13 AM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > > which is not the case with core_cost (and similar with skylake_cost):
> > >
> > >   2, 2, 4,                /* cost of moving XMM,YMM,ZMM register */
> > >   {6, 6, 6, 6, 12},            /* cost of loading SSE registers
> > >                        in 32,64,128,256 and 512-bit */
> > >   {6, 6, 6, 6, 12},            /* cost of storing SSE registers
> > >                        in 32,64,128,256 and 512-bit */
> > >   2, 2,                    /* SSE->integer and integer->SSE moves */
> > >
> > > We have the same cost of moving between integer registers (by default
> > > set to 2), between SSE registers and between integer and SSE register
> > > sets. I think that at least the cost of moves between regsets should
> > > be substantially higher, rs6000 uses 3x cost of intra-regset moves;
> > > that would translate to the value of 6. The value should be low enough
> > > to keep the cost below the value that forces move through the memory.
> > > Changing core register allocation cost of SSE <-> integer to:
> > >
> > > --cut here--
> > > Index: config/i386/x86-tune-costs.h
> > > ===================================================================
> > > --- config/i386/x86-tune-costs.h        (revision 275281)
> > > +++ config/i386/x86-tune-costs.h        (working copy)
> > > @@ -2555,7 +2555,7 @@ struct processor_costs core_cost = {
> > >                                            in 32,64,128,256 and 512-bit */
> > >    {6, 6, 6, 6, 12},                    /* cost of storing SSE registers
> > >                                            in 32,64,128,256 and 512-bit */
> > > -  2, 2,                                        /* SSE->integer and
> > > integer->SSE moves */
> > > +  6, 6,                                        /* SSE->integer and
> > > integer->SSE moves */
> > >    /* End of register allocator costs.  */
> > >    },
> > >
> > > --cut here--
> > >
> > > still produces direct move in gcc.target/i386/minmax-6.c
> > >
> > > I think that in addition to attached patch, values between 2 and 6
> > > should be considered in benchmarking. Unfortunately, without access to
> > > regressed SPEC tests, I can't analyse these changes by myself.
> > >
> > > Uros.
> >
> > Apply similar change to skylake_cost, on skylake workstation we got
> > performance like:
> > ---------------------------
> > version                                                            |
> > 548_exchange_r score
> > gcc10_20180822:                                           |   10
> > apply remove_max8                                       |   8.9
> > also apply increase integer_tofrom_sse cost |   9.69
> > -----------------------------
> > Still 3% regression which is related to _gfortran_mminloc0_4_i4 in
> > libgfortran.so.5.0.0.
> >
> > I found suspicious code as bellow, does it affect?
>
> Hard to say without access to the test, but I'm glad that changing the
> knob has noticeable effect. I think that (as said by Alan) a fine-tune
> of register pressure calculation will be needed to push this forward.
>
> Uros.
>
> > ------------------
> > modified   gcc/config/i386/i386-features.c
> > @@ -590,7 +590,7 @@ general_scalar_chain::compute_convert_gain ()
> >    if (dump_file)
> >      fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
> >
> > -  /* ???  What about integer to SSE?  */
> > +  /* ???  What about integer to SSE?  */???
> >    EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
> >      cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
> > ------------------
> > --
> > BR,
> > Hongtao

Note:
Removing limit of cost would introduce lots of regressions in SPEC2017 as follow
--------------------------------
531.deepsjeng_r  -7.18%
548.exchange_r  -6.70%
557.xz_r -6.74%
508.namd_r -2.81%
527.cam4_r -6.48%
544.nab_r -3.99%

Tested on skylake server.
-------------------------------------
How about  changing cost from 2 to 8 until we figure out a better number.

Richard Biener Sept. 3, 2019, 11:24 a.m. UTC | #15

On Tue, Sep 3, 2019 at 9:57 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Mon, Sep 2, 2019 at 4:41 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > On Mon, Sep 2, 2019 at 10:13 AM Hongtao Liu <crazylht@gmail.com> wrote:
> > >
> > > > which is not the case with core_cost (and similar with skylake_cost):
> > > >
> > > >   2, 2, 4,                /* cost of moving XMM,YMM,ZMM register */
> > > >   {6, 6, 6, 6, 12},            /* cost of loading SSE registers
> > > >                        in 32,64,128,256 and 512-bit */
> > > >   {6, 6, 6, 6, 12},            /* cost of storing SSE registers
> > > >                        in 32,64,128,256 and 512-bit */
> > > >   2, 2,                    /* SSE->integer and integer->SSE moves */
> > > >
> > > > We have the same cost of moving between integer registers (by default
> > > > set to 2), between SSE registers and between integer and SSE register
> > > > sets. I think that at least the cost of moves between regsets should
> > > > be substantially higher, rs6000 uses 3x cost of intra-regset moves;
> > > > that would translate to the value of 6. The value should be low enough
> > > > to keep the cost below the value that forces move through the memory.
> > > > Changing core register allocation cost of SSE <-> integer to:
> > > >
> > > > --cut here--
> > > > Index: config/i386/x86-tune-costs.h
> > > > ===================================================================
> > > > --- config/i386/x86-tune-costs.h        (revision 275281)
> > > > +++ config/i386/x86-tune-costs.h        (working copy)
> > > > @@ -2555,7 +2555,7 @@ struct processor_costs core_cost = {
> > > >                                            in 32,64,128,256 and 512-bit */
> > > >    {6, 6, 6, 6, 12},                    /* cost of storing SSE registers
> > > >                                            in 32,64,128,256 and 512-bit */
> > > > -  2, 2,                                        /* SSE->integer and
> > > > integer->SSE moves */
> > > > +  6, 6,                                        /* SSE->integer and
> > > > integer->SSE moves */
> > > >    /* End of register allocator costs.  */
> > > >    },
> > > >
> > > > --cut here--
> > > >
> > > > still produces direct move in gcc.target/i386/minmax-6.c
> > > >
> > > > I think that in addition to attached patch, values between 2 and 6
> > > > should be considered in benchmarking. Unfortunately, without access to
> > > > regressed SPEC tests, I can't analyse these changes by myself.
> > > >
> > > > Uros.
> > >
> > > Apply similar change to skylake_cost, on skylake workstation we got
> > > performance like:
> > > ---------------------------
> > > version                                                            |
> > > 548_exchange_r score
> > > gcc10_20180822:                                           |   10
> > > apply remove_max8                                       |   8.9
> > > also apply increase integer_tofrom_sse cost |   9.69
> > > -----------------------------
> > > Still 3% regression which is related to _gfortran_mminloc0_4_i4 in
> > > libgfortran.so.5.0.0.
> > >
> > > I found suspicious code as bellow, does it affect?
> >
> > Hard to say without access to the test, but I'm glad that changing the
> > knob has noticeable effect. I think that (as said by Alan) a fine-tune
> > of register pressure calculation will be needed to push this forward.
> >
> > Uros.
> >
> > > ------------------
> > > modified   gcc/config/i386/i386-features.c
> > > @@ -590,7 +590,7 @@ general_scalar_chain::compute_convert_gain ()
> > >    if (dump_file)
> > >      fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
> > >
> > > -  /* ???  What about integer to SSE?  */
> > > +  /* ???  What about integer to SSE?  */???
> > >    EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
> > >      cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
> > > ------------------
> > > --
> > > BR,
> > > Hongtao
>
> Note:
> Removing limit of cost would introduce lots of regressions in SPEC2017 as follow
> --------------------------------
> 531.deepsjeng_r  -7.18%
> 548.exchange_r  -6.70%
> 557.xz_r -6.74%
> 508.namd_r -2.81%
> 527.cam4_r -6.48%
> 544.nab_r -3.99%
>
> Tested on skylake server.
> -------------------------------------
> How about  changing cost from 2 to 8 until we figure out a better number.

Certainly works for me.  Note the STV code uses the "other" sse_to_integer
number and the numbers in question here are those for the RA.  There's
a multitude of values used in the tables here, including some a lot larger.
So the overall bumping to 8 certainly was the wrong thing to do and instead
individual numbers should have been adjusted (didn't look at the history
of that bumping).  For example Pentium4 has quite high bases for move
costs, like xmm <-> xmm move costing 12 and SSE->integer costing 20
while the opposite 12.

So yes, we want to revert the patch by applying its effect to the
individual cost tables so we can revisit this for the still interesting
micro-architectures.

Richard.

> --
> BR,
> Hongtao

Richard Biener Sept. 3, 2019, 11:33 a.m. UTC | #16

On Tue, Sep 3, 2019 at 1:24 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Tue, Sep 3, 2019 at 9:57 AM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > On Mon, Sep 2, 2019 at 4:41 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> > >
> > > On Mon, Sep 2, 2019 at 10:13 AM Hongtao Liu <crazylht@gmail.com> wrote:
> > > >
> > > > > which is not the case with core_cost (and similar with skylake_cost):
> > > > >
> > > > >   2, 2, 4,                /* cost of moving XMM,YMM,ZMM register */
> > > > >   {6, 6, 6, 6, 12},            /* cost of loading SSE registers
> > > > >                        in 32,64,128,256 and 512-bit */
> > > > >   {6, 6, 6, 6, 12},            /* cost of storing SSE registers
> > > > >                        in 32,64,128,256 and 512-bit */
> > > > >   2, 2,                    /* SSE->integer and integer->SSE moves */
> > > > >
> > > > > We have the same cost of moving between integer registers (by default
> > > > > set to 2), between SSE registers and between integer and SSE register
> > > > > sets. I think that at least the cost of moves between regsets should
> > > > > be substantially higher, rs6000 uses 3x cost of intra-regset moves;
> > > > > that would translate to the value of 6. The value should be low enough
> > > > > to keep the cost below the value that forces move through the memory.
> > > > > Changing core register allocation cost of SSE <-> integer to:
> > > > >
> > > > > --cut here--
> > > > > Index: config/i386/x86-tune-costs.h
> > > > > ===================================================================
> > > > > --- config/i386/x86-tune-costs.h        (revision 275281)
> > > > > +++ config/i386/x86-tune-costs.h        (working copy)
> > > > > @@ -2555,7 +2555,7 @@ struct processor_costs core_cost = {
> > > > >                                            in 32,64,128,256 and 512-bit */
> > > > >    {6, 6, 6, 6, 12},                    /* cost of storing SSE registers
> > > > >                                            in 32,64,128,256 and 512-bit */
> > > > > -  2, 2,                                        /* SSE->integer and
> > > > > integer->SSE moves */
> > > > > +  6, 6,                                        /* SSE->integer and
> > > > > integer->SSE moves */
> > > > >    /* End of register allocator costs.  */
> > > > >    },
> > > > >
> > > > > --cut here--
> > > > >
> > > > > still produces direct move in gcc.target/i386/minmax-6.c
> > > > >
> > > > > I think that in addition to attached patch, values between 2 and 6
> > > > > should be considered in benchmarking. Unfortunately, without access to
> > > > > regressed SPEC tests, I can't analyse these changes by myself.
> > > > >
> > > > > Uros.
> > > >
> > > > Apply similar change to skylake_cost, on skylake workstation we got
> > > > performance like:
> > > > ---------------------------
> > > > version                                                            |
> > > > 548_exchange_r score
> > > > gcc10_20180822:                                           |   10
> > > > apply remove_max8                                       |   8.9
> > > > also apply increase integer_tofrom_sse cost |   9.69
> > > > -----------------------------
> > > > Still 3% regression which is related to _gfortran_mminloc0_4_i4 in
> > > > libgfortran.so.5.0.0.
> > > >
> > > > I found suspicious code as bellow, does it affect?
> > >
> > > Hard to say without access to the test, but I'm glad that changing the
> > > knob has noticeable effect. I think that (as said by Alan) a fine-tune
> > > of register pressure calculation will be needed to push this forward.
> > >
> > > Uros.
> > >
> > > > ------------------
> > > > modified   gcc/config/i386/i386-features.c
> > > > @@ -590,7 +590,7 @@ general_scalar_chain::compute_convert_gain ()
> > > >    if (dump_file)
> > > >      fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
> > > >
> > > > -  /* ???  What about integer to SSE?  */
> > > > +  /* ???  What about integer to SSE?  */???
> > > >    EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
> > > >      cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
> > > > ------------------
> > > > --
> > > > BR,
> > > > Hongtao
> >
> > Note:
> > Removing limit of cost would introduce lots of regressions in SPEC2017 as follow
> > --------------------------------
> > 531.deepsjeng_r  -7.18%
> > 548.exchange_r  -6.70%
> > 557.xz_r -6.74%
> > 508.namd_r -2.81%
> > 527.cam4_r -6.48%
> > 544.nab_r -3.99%
> >
> > Tested on skylake server.
> > -------------------------------------
> > How about  changing cost from 2 to 8 until we figure out a better number.
>
> Certainly works for me.  Note the STV code uses the "other" sse_to_integer
> number and the numbers in question here are those for the RA.  There's
> a multitude of values used in the tables here, including some a lot larger.
> So the overall bumping to 8 certainly was the wrong thing to do and instead
> individual numbers should have been adjusted (didn't look at the history
> of that bumping).

For reference:

r125951 | uros | 2007-06-22 19:51:06 +0200 (Fri, 22 Jun 2007) | 6 lines

    PR target/32413
    * config/i386/i386.c (ix86_register_move_cost): Rise the cost of
    moves between MMX/SSE registers to at least 8 units to prevent
    ICE caused by non-tieable SI/HI/QImodes in SSE registers.

should probably have been "twice the cost of X" or something like that
instead where X be some reg-reg move cost.

>  For example Pentium4 has quite high bases for move
> costs, like xmm <-> xmm move costing 12 and SSE->integer costing 20
> while the opposite 12.
>
> So yes, we want to revert the patch by applying its effect to the
> individual cost tables so we can revisit this for the still interesting
> micro-architectures.
>
> Richard.
>
> > --
> > BR,
> > Hongtao

Uros Bizjak Sept. 3, 2019, 4:50 p.m. UTC | #17

On Tue, Sep 3, 2019 at 1:33 PM Richard Biener
<richard.guenther@gmail.com> wrote:

> > > Note:
> > > Removing limit of cost would introduce lots of regressions in SPEC2017 as follow
> > > --------------------------------
> > > 531.deepsjeng_r  -7.18%
> > > 548.exchange_r  -6.70%
> > > 557.xz_r -6.74%
> > > 508.namd_r -2.81%
> > > 527.cam4_r -6.48%
> > > 544.nab_r -3.99%
> > >
> > > Tested on skylake server.
> > > -------------------------------------
> > > How about  changing cost from 2 to 8 until we figure out a better number.
> >
> > Certainly works for me.  Note the STV code uses the "other" sse_to_integer
> > number and the numbers in question here are those for the RA.  There's
> > a multitude of values used in the tables here, including some a lot larger.
> > So the overall bumping to 8 certainly was the wrong thing to do and instead
> > individual numbers should have been adjusted (didn't look at the history
> > of that bumping).
>
> For reference:
>
> r125951 | uros | 2007-06-22 19:51:06 +0200 (Fri, 22 Jun 2007) | 6 lines
>
>     PR target/32413
>     * config/i386/i386.c (ix86_register_move_cost): Rise the cost of
>     moves between MMX/SSE registers to at least 8 units to prevent
>     ICE caused by non-tieable SI/HI/QImodes in SSE registers.
>
> should probably have been "twice the cost of X" or something like that
> instead where X be some reg-reg move cost.

Thanks for the reference. It looks that the patch fixes the issue in
the wrong place, this should be solved in
inline_secondary_memory_needed:

      /* Between SSE and general, we have moves no larger than word size.  */
      if (!(INTEGER_CLASS_P (class1) || INTEGER_CLASS_P (class2))
           || GET_MODE_SIZE (mode) < GET_MODE_SIZE (SImode)
           || GET_MODE_SIZE (mode) > UNITS_PER_WORD)
        return true;

as an alternative to implement QI and HImode moves as a SImode move
between SSE and int<->SSE registers. We have
ix86_secondary_memory_needed_mode that extends QI and HImode secondary
memory to SImode, so this should solve PR32413.

Other than that, what to do with the bizzare property of direct moves
that benchmark far worse than indirect moves? I was expecting that
keeping the cost of direct inter-regset moves just a bit below the
cost of int<->mem<->xmm, but (much ?) higher than itra-regset moves
would prevent unwanted wandering of values between register sets,
while still generating the direct move when needed. While this almost
fixes the runtime regression, it is not clear to me from Hongtao Liu's
message if  Richard's 2019-08-27 fixes the remaining regression or
not). Liu, can you please clarify?

> >  For example Pentium4 has quite high bases for move
> > costs, like xmm <-> xmm move costing 12 and SSE->integer costing 20
> > while the opposite 12.
> >
> > So yes, we want to revert the patch by applying its effect to the
> > individual cost tables so we can revisit this for the still interesting
> > micro-architectures.

Uros.

Hongtao Liu Sept. 4, 2019, 1:44 a.m. UTC | #18

On Wed, Sep 4, 2019 at 12:50 AM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Tue, Sep 3, 2019 at 1:33 PM Richard Biener
> <richard.guenther@gmail.com> wrote:
>
> > > > Note:
> > > > Removing limit of cost would introduce lots of regressions in SPEC2017 as follow
> > > > --------------------------------
> > > > 531.deepsjeng_r  -7.18%
> > > > 548.exchange_r  -6.70%
> > > > 557.xz_r -6.74%
> > > > 508.namd_r -2.81%
> > > > 527.cam4_r -6.48%
> > > > 544.nab_r -3.99%
> > > >
> > > > Tested on skylake server.
> > > > -------------------------------------
> > > > How about  changing cost from 2 to 8 until we figure out a better number.
> > >
> > > Certainly works for me.  Note the STV code uses the "other" sse_to_integer
> > > number and the numbers in question here are those for the RA.  There's
> > > a multitude of values used in the tables here, including some a lot larger.
> > > So the overall bumping to 8 certainly was the wrong thing to do and instead
> > > individual numbers should have been adjusted (didn't look at the history
> > > of that bumping).
> >
> > For reference:
> >
> > r125951 | uros | 2007-06-22 19:51:06 +0200 (Fri, 22 Jun 2007) | 6 lines
> >
> >     PR target/32413
> >     * config/i386/i386.c (ix86_register_move_cost): Rise the cost of
> >     moves between MMX/SSE registers to at least 8 units to prevent
> >     ICE caused by non-tieable SI/HI/QImodes in SSE registers.
> >
> > should probably have been "twice the cost of X" or something like that
> > instead where X be some reg-reg move cost.
>
> Thanks for the reference. It looks that the patch fixes the issue in
> the wrong place, this should be solved in
> inline_secondary_memory_needed:
>
>       /* Between SSE and general, we have moves no larger than word size.  */
>       if (!(INTEGER_CLASS_P (class1) || INTEGER_CLASS_P (class2))
>            || GET_MODE_SIZE (mode) < GET_MODE_SIZE (SImode)
>            || GET_MODE_SIZE (mode) > UNITS_PER_WORD)
>         return true;
>
> as an alternative to implement QI and HImode moves as a SImode move
> between SSE and int<->SSE registers. We have
> ix86_secondary_memory_needed_mode that extends QI and HImode secondary
> memory to SImode, so this should solve PR32413.
>
> Other than that, what to do with the bizzare property of direct moves
> that benchmark far worse than indirect moves? I was expecting that
> keeping the cost of direct inter-regset moves just a bit below the
> cost of int<->mem<->xmm, but (much ?) higher than itra-regset moves
> would prevent unwanted wandering of values between register sets,
> while still generating the direct move when needed. While this almost

I've not tested it yet.
So i'll start a test about this patch(change cost from 2-->6) with
Richard's change.
I'll keep you informed when finishing test.

> fixes the runtime regression, it is not clear to me from Hongtao Liu's
> message if  Richard's 2019-08-27 fixes the remaining regression or
> not). Liu, can you please clarify?
>
--------------------------------
531.deepsjeng_r  -7.18%
548.exchange_r  -6.70%
557.xz_r -6.74%
508.namd_r -2.81%
527.cam4_r -6.48%
544.nab_r -3.99%

Tested on skylake server.
-------------------------------------
Those regressions are comparing gcc10_20190830 to gcc10_20190824 which
are mainly caused by removing limit of 8.

> > >  For example Pentium4 has quite high bases for move
> > > costs, like xmm <-> xmm move costing 12 and SSE->integer costing 20
> > > while the opposite 12.
> > >
> > > So yes, we want to revert the patch by applying its effect to the
> > > individual cost tables so we can revisit this for the still interesting
> > > micro-architectures.
>
> Uros.

Hongtao Liu Sept. 5, 2019, 5:50 a.m. UTC | #19

On Wed, Sep 4, 2019 at 9:44 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Wed, Sep 4, 2019 at 12:50 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > On Tue, Sep 3, 2019 at 1:33 PM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> >
> > > > > Note:
> > > > > Removing limit of cost would introduce lots of regressions in SPEC2017 as follow
> > > > > --------------------------------
> > > > > 531.deepsjeng_r  -7.18%
> > > > > 548.exchange_r  -6.70%
> > > > > 557.xz_r -6.74%
> > > > > 508.namd_r -2.81%
> > > > > 527.cam4_r -6.48%
> > > > > 544.nab_r -3.99%
> > > > >
> > > > > Tested on skylake server.
> > > > > -------------------------------------
> > > > > How about  changing cost from 2 to 8 until we figure out a better number.
> > > >
> > > > Certainly works for me.  Note the STV code uses the "other" sse_to_integer
> > > > number and the numbers in question here are those for the RA.  There's
> > > > a multitude of values used in the tables here, including some a lot larger.
> > > > So the overall bumping to 8 certainly was the wrong thing to do and instead
> > > > individual numbers should have been adjusted (didn't look at the history
> > > > of that bumping).
> > >
> > > For reference:
> > >
> > > r125951 | uros | 2007-06-22 19:51:06 +0200 (Fri, 22 Jun 2007) | 6 lines
> > >
> > >     PR target/32413
> > >     * config/i386/i386.c (ix86_register_move_cost): Rise the cost of
> > >     moves between MMX/SSE registers to at least 8 units to prevent
> > >     ICE caused by non-tieable SI/HI/QImodes in SSE registers.
> > >
> > > should probably have been "twice the cost of X" or something like that
> > > instead where X be some reg-reg move cost.
> >
> > Thanks for the reference. It looks that the patch fixes the issue in
> > the wrong place, this should be solved in
> > inline_secondary_memory_needed:
> >
> >       /* Between SSE and general, we have moves no larger than word size.  */
> >       if (!(INTEGER_CLASS_P (class1) || INTEGER_CLASS_P (class2))
> >            || GET_MODE_SIZE (mode) < GET_MODE_SIZE (SImode)
> >            || GET_MODE_SIZE (mode) > UNITS_PER_WORD)
> >         return true;
> >
> > as an alternative to implement QI and HImode moves as a SImode move
> > between SSE and int<->SSE registers. We have
> > ix86_secondary_memory_needed_mode that extends QI and HImode secondary
> > memory to SImode, so this should solve PR32413.
> >
> > Other than that, what to do with the bizzare property of direct moves
> > that benchmark far worse than indirect moves? I was expecting that
> > keeping the cost of direct inter-regset moves just a bit below the
> > cost of int<->mem<->xmm, but (much ?) higher than itra-regset moves
> > would prevent unwanted wandering of values between register sets,
> > while still generating the direct move when needed. While this almost
>
> I've not tested it yet.
> So i'll start a test about this patch(change cost from 2-->6) with
> Richard's change.
> I'll keep you informed when finishing test.
>
> > fixes the runtime regression, it is not clear to me from Hongtao Liu's
> > message if  Richard's 2019-08-27 fixes the remaining regression or
> > not). Liu, can you please clarify?
> >
> --------------------------------
> 531.deepsjeng_r  -7.18%
> 548.exchange_r  -6.70%
> 557.xz_r -6.74%
> 508.namd_r -2.81%
> 527.cam4_r -6.48%
> 544.nab_r -3.99%
>
> Tested on skylake server.
> -------------------------------------
> Those regressions are comparing gcc10_20190830 to gcc10_20190824 which
> are mainly caused by removing limit of 8.
>
> > > >  For example Pentium4 has quite high bases for move
> > > > costs, like xmm <-> xmm move costing 12 and SSE->integer costing 20
> > > > while the opposite 12.
> > > >
> > > > So yes, we want to revert the patch by applying its effect to the
> > > > individual cost tables so we can revisit this for the still interesting
> > > > micro-architectures.
> >
> > Uros.
>
>
>
> --
> BR,
> Hongtao

Change cost from 2->6 got
-------------
531.deepsjeng_r  9.64%
548.exchange_r  10.24%
557.xc_r              7.99%
508.namd_r         1.08%
527.cam4_r          6.91%
553.nab_r            3.06%
------------

for 531,548,557,527, even better comparing to version before regression.
for 508,533, still little regressions comparing to version before regression.

Uros Bizjak Sept. 5, 2019, 8:53 a.m. UTC | #20

On Thu, Sep 5, 2019 at 7:47 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Wed, Sep 4, 2019 at 9:44 AM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > On Wed, Sep 4, 2019 at 12:50 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> > >
> > > On Tue, Sep 3, 2019 at 1:33 PM Richard Biener
> > > <richard.guenther@gmail.com> wrote:
> > >
> > > > > > Note:
> > > > > > Removing limit of cost would introduce lots of regressions in SPEC2017 as follow
> > > > > > --------------------------------
> > > > > > 531.deepsjeng_r  -7.18%
> > > > > > 548.exchange_r  -6.70%
> > > > > > 557.xz_r -6.74%
> > > > > > 508.namd_r -2.81%
> > > > > > 527.cam4_r -6.48%
> > > > > > 544.nab_r -3.99%
> > > > > >
> > > > > > Tested on skylake server.
> > > > > > -------------------------------------
> > > > > > How about  changing cost from 2 to 8 until we figure out a better number.
> > > > >
> > > > > Certainly works for me.  Note the STV code uses the "other" sse_to_integer
> > > > > number and the numbers in question here are those for the RA.  There's
> > > > > a multitude of values used in the tables here, including some a lot larger.
> > > > > So the overall bumping to 8 certainly was the wrong thing to do and instead
> > > > > individual numbers should have been adjusted (didn't look at the history
> > > > > of that bumping).
> > > >
> > > > For reference:
> > > >
> > > > r125951 | uros | 2007-06-22 19:51:06 +0200 (Fri, 22 Jun 2007) | 6 lines
> > > >
> > > >     PR target/32413
> > > >     * config/i386/i386.c (ix86_register_move_cost): Rise the cost of
> > > >     moves between MMX/SSE registers to at least 8 units to prevent
> > > >     ICE caused by non-tieable SI/HI/QImodes in SSE registers.
> > > >
> > > > should probably have been "twice the cost of X" or something like that
> > > > instead where X be some reg-reg move cost.
> > >
> > > Thanks for the reference. It looks that the patch fixes the issue in
> > > the wrong place, this should be solved in
> > > inline_secondary_memory_needed:
> > >
> > >       /* Between SSE and general, we have moves no larger than word size.  */
> > >       if (!(INTEGER_CLASS_P (class1) || INTEGER_CLASS_P (class2))
> > >            || GET_MODE_SIZE (mode) < GET_MODE_SIZE (SImode)
> > >            || GET_MODE_SIZE (mode) > UNITS_PER_WORD)
> > >         return true;
> > >
> > > as an alternative to implement QI and HImode moves as a SImode move
> > > between SSE and int<->SSE registers. We have
> > > ix86_secondary_memory_needed_mode that extends QI and HImode secondary
> > > memory to SImode, so this should solve PR32413.
> > >
> > > Other than that, what to do with the bizzare property of direct moves
> > > that benchmark far worse than indirect moves? I was expecting that
> > > keeping the cost of direct inter-regset moves just a bit below the
> > > cost of int<->mem<->xmm, but (much ?) higher than itra-regset moves
> > > would prevent unwanted wandering of values between register sets,
> > > while still generating the direct move when needed. While this almost
> >
> > I've not tested it yet.
> > So i'll start a test about this patch(change cost from 2-->6) with
> > Richard's change.
> > I'll keep you informed when finishing test.
> >
> > > fixes the runtime regression, it is not clear to me from Hongtao Liu's
> > > message if  Richard's 2019-08-27 fixes the remaining regression or
> > > not). Liu, can you please clarify?
> > >
> > --------------------------------
> > 531.deepsjeng_r  -7.18%
> > 548.exchange_r  -6.70%
> > 557.xz_r -6.74%
> > 508.namd_r -2.81%
> > 527.cam4_r -6.48%
> > 544.nab_r -3.99%
> >
> > Tested on skylake server.
> > -------------------------------------
> > Those regressions are comparing gcc10_20190830 to gcc10_20190824 which
> > are mainly caused by removing limit of 8.
> >
> > > > >  For example Pentium4 has quite high bases for move
> > > > > costs, like xmm <-> xmm move costing 12 and SSE->integer costing 20
> > > > > while the opposite 12.
> > > > >
> > > > > So yes, we want to revert the patch by applying its effect to the
> > > > > individual cost tables so we can revisit this for the still interesting
> > > > > micro-architectures.
> > >
> > > Uros.
> >
> >
> >
> > --
> > BR,
> > Hongtao
>
> Change cost from 2->6 got
> -------------
> 531.deepsjeng_r  9.64%
> 548.exchange_r  10.24%
> 557.xc_r              7.99%
> 508.namd_r         1.08%
> 527.cam4_r          6.91%
> 553.nab_r            3.06%
> ------------
>
> for 531,548,557,527, even better comparing to version before regression.
> for 508,533, still little regressions comparing to version before regression.

Good, that brings us into "noise" region.

Based on these results and other findings, I propose the following solution:

- The inter-regset move costs of architectures, that have been defined
before r125951 remain the same. These are: size, i386, i486, pentium,
pentiumpro, geode, k6, athlon, k8, amdfam10, pentium4 and nocona.
- bdver, btver1 and btver2 have costs higher than 8, so they are not affected.
- lakemont, znver1, znver2, atom, slm, intel and generic costs have
inter-regset costs above intra-regset and below or equal memory
load/store cost, should remain as they are. Additionally, intel and
generic costs are regularly re-tuned.
-  only skylake and core costs remain problematic

So, I propose to raise XMM<->intreg costs of skylake and core
architectures to 6 to solve the regression. These can be fine-tuned
later, we are now able to change the cost for RA independently of RTX
costs. Also, the RA cost can be asymmetrical.

Attached patch implements the proposal. If there are no other
proposals or discussions, I plan to commit it on Friday.

Uros.
diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index 3381b8bf143c..00edece3eb68 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -1610,7 +1610,7 @@ struct processor_costs skylake_cost = {
 					   in 32,64,128,256 and 512-bit */
   {8, 8, 8, 12, 24},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  2, 2,					/* SSE->integer and integer->SSE moves */
+  6, 6,					/* SSE->integer and integer->SSE moves */
   /* End of register allocator costs.  */
   },
 
@@ -2555,7 +2555,7 @@ struct processor_costs core_cost = {
 					   in 32,64,128,256 and 512-bit */
   {6, 6, 6, 6, 12},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  2, 2,					/* SSE->integer and integer->SSE moves */
+  6, 6,					/* SSE->integer and integer->SSE moves */
   /* End of register allocator costs.  */
   },

[i386] : Improve STV conversion of shifts

Commit Message

Comments

Patch