[AArch64] Fix aarch64_ira_change_pseudo_allocno_class

Message ID	DB5PR08MB10304DAA91A309666E817B6483940@DB5PR08MB1030.eurprd08.prod.outlook.com
State	New
Headers	show Return-Path: <gcc-patches-return-478189-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:content-type :content-transfer-encoding:mime-version; q=dns; s=default; b=rQz FiUT91vqOZIUPknaLFQVQrjnW2YUdoqyEnbzi8qkxp10g1N0K1dhj57HpeQukTSC cP1hFM7pgA6ibw+82oFb7vdm4n1BkRu5/EGF8eVWyJQet4HbR32y0Zj2RrfawM/h H+TbfmvRGkDLvPilEGifQMn3Kl+zwbJM1WZ8L6rA= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org From: Wilco Dijkstra <Wilco.Dijkstra@arm.com> To: GCC Patches <gcc-patches@gcc.gnu.org>, James Greenhalgh <James.Greenhalgh@arm.com> CC: nd <nd@arm.com>, Richard Sandiford <Richard.Sandiford@arm.com> Subject: [PATCH][AArch64] Fix aarch64_ira_change_pseudo_allocno_class Date: Tue, 22 May 2018 16:08:16 +0000 Message-ID: <DB5PR08MB10304DAA91A309666E817B6483940@DB5PR08MB1030.eurprd08.prod.outlook.com> nodisclaimer: True received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) x-microsoft-antispam-message-info: hhVu13rZlZZlGazTUy20nc8AvrD01y9TNR2jYrr/RmBMRWEHucsmUFcOBtuKi+1CxkVUCqKmMSVNmsdrCbbVf10rnbV9kMIUZM1Iryao8TE904IuoFyZzkBhDcYZL5U3OcEwIB3p6iFCYJ9I7AejuBJccDMb1JzNsV6PW+0Eu8BSloRg8bQ4rpaCTw1l/2XJ spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-MS-Office365-Filtering-Correlation-Id: 2f5801f8-54c2-42e2-76ba-08d5bffe3a91
Series	[AArch64] Fix aarch64_ira_change_pseudo_allocno_class \| expand [AArch64] Fix aarch64_ira_change_pseudo_allocno_class

Wilco Dijkstra May 22, 2018, 4:08 p.m. UTC

A recent commit removing '*' from the md files caused a large regression in h264ref.
It turns out aarch64_ira_change_pseudo_allocno_class is no longer effective after the
SVE changes, and the combination results in the regression.  This patch fixes it by
using the new POINTER_AND_FP_REGS register class which is now used instead of ALL_REGS.
Add a missing ? to aarch64_get_lane to fix a failure in the testsuite.

Passes regress, OK for commit?

Since it is a regression introduced in GCC8, OK to backport to GCC8?

ChangeLog:
2018-05-22  Wilco Dijkstra  <wdijkstr@arm.com>

	* config/aarch64/aarch64.c (aarch64_ira_change_pseudo_allocno_class):
	Use POINTER_AND_FP_REGSinstead of ALL_REGS.
	* config/aarch64/aarch64-simd.md (aarch64_get_lane): Increase cost of r=w alternative.
--

Richard Sandiford May 22, 2018, 5:12 p.m. UTC | #1

Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:

> A recent commit removing '*' from the md files caused a large regression
> in h264ref.
> It turns out aarch64_ira_change_pseudo_allocno_class is no longer
> effective after the
> SVE changes, and the combination results in the regression.  This patch
> fixes it by
> using the new POINTER_AND_FP_REGS register class which is now used
> instead of ALL_REGS.
> Add a missing ? to aarch64_get_lane to fix a failure in the testsuite.
>
> Passes regress, OK for commit?
>
> Since it is a regression introduced in GCC8, OK to backport to GCC8?
>
> ChangeLog:
> 2018-05-22  Wilco Dijkstra  <wdijkstr@arm.com>
>
> 	* config/aarch64/aarch64.c (aarch64_ira_change_pseudo_allocno_class):
> 	Use POINTER_AND_FP_REGSinstead of ALL_REGS.
> 	* config/aarch64/aarch64-simd.md (aarch64_get_lane): Increase
> cost of r=w alternative.
> --
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
> index 2ebd256329c1a6a6b790d16955cbcee3feca456c..3d5fe44b53198a92afb726712c6e9dee890afe38 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -2961,7 +2961,7 @@ (define_insn "*aarch64_get_lane_zero_extendsi<mode>"
>  ;; is guaranteed so upper bits should be considered undefined.
>  ;; RTL uses GCC vector extension indices throughout so flip only for assembly.
>  (define_insn "aarch64_get_lane<mode>"
> -  [(set (match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand" "=r, w, Utv")
> +  [(set (match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand" "=?r, w, Utv")
>  	(vec_select:<VEL>
>  	  (match_operand:VALL_F16 1 "register_operand" "w, w, w")
>  	  (parallel [(match_operand:SI 2 "immediate_operand" "i, i, i")])))]
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 47d98dfd095cdcd15908a86091cf2f8a4d6137b1..a119760c7f332aded200fa1b5bcfb1bbac7b6420 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -1059,16 +1059,17 @@ aarch64_err_no_fpadvsimd (machine_mode mode, const char *msg)
>  }
>  
>  /* Implement TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS.
> -   The register allocator chooses ALL_REGS if FP_REGS and GENERAL_REGS have
> -   the same cost even if ALL_REGS has a much larger cost.  ALL_REGS is also
> -   used if the cost of both FP_REGS and GENERAL_REGS is lower than the memory
> -   cost (in this case the best class is the lowest cost one).  Using ALL_REGS
> -   irrespectively of its cost results in bad allocations with many redundant
> -   int<->FP moves which are expensive on various cores.
> -   To avoid this we don't allow ALL_REGS as the allocno class, but force a
> -   decision between FP_REGS and GENERAL_REGS.  We use the allocno class if it
> -   isn't ALL_REGS.  Similarly, use the best class if it isn't ALL_REGS.
> -   Otherwise set the allocno class depending on the mode.
> +   The register allocator chooses POINTER_AND_FP_REGS if FP_REGS and
> +   GENERAL_REGS have the same cost - even if POINTER_AND_FP_REGS has a much
> +   higher cost.  POINTER_AND_FP_REGS is also used if the cost of both FP_REGS
> +   and GENERAL_REGS is lower than the memory cost (in this case the best class
> +   is the lowest cost one).  Using POINTER_AND_FP_REGS irrespectively of its
> +   cost results in bad allocations with many redundant int<->FP moves which
> +   are expensive on various cores.
> +   To avoid this we don't allow POINTER_AND_FP_REGS as the allocno class, but
> +   force a decision between FP_REGS and GENERAL_REGS.  We use the allocno class
> +   if it isn't POINTER_AND_FP_REGS.  Similarly, use the best class if it isn't
> +   POINTER_AND_FP_REGS.  Otherwise set the allocno class depending on the mode.
>     The result of this is that it is no longer inefficient to have a higher
>     memory move cost than the register move cost.
>  */
> @@ -1079,10 +1080,10 @@ aarch64_ira_change_pseudo_allocno_class (int regno, reg_class_t allocno_class,
>  {
>    machine_mode mode;
>  
> -  if (allocno_class != ALL_REGS)
> +  if (allocno_class != POINTER_AND_FP_REGS)
>      return allocno_class;
>  
> -  if (best_class != ALL_REGS)
> +  if (best_class != POINTER_AND_FP_REGS)
>      return best_class;
>  
>    mode = PSEUDO_REGNO_MODE (regno);

I think it'd be better to use !reg_class_subset_p (POINTER_AND_FP_REGS, ...)
instead of ... != POINTER_AND_FP_REGS, since this in principle still applies
to ALL_REGS too.

FWIW, the patch looks good to me with that change.

Thanks,
Richard

Wilco Dijkstra May 23, 2018, 11:21 a.m. UTC | #2

Richard Sandiford wrote:

> -  if (allocno_class != ALL_REGS)
> +  if (allocno_class != POINTER_AND_FP_REGS)
>      return allocno_class;
>  
> -  if (best_class != ALL_REGS)
> +  if (best_class != POINTER_AND_FP_REGS)
>      return best_class;
>  
>    mode = PSEUDO_REGNO_MODE (regno);

> I think it'd be better to use !reg_class_subset_p (POINTER_AND_FP_REGS, ...)
> instead of ... != POINTER_AND_FP_REGS, since this in principle still applies
> to ALL_REGS too.
> 
> FWIW, the patch looks good to me with that change.

How does reg_class_subset_p help? In my testing I didn't see ALL_REGS ever
used (and I don't believe it's possible to get it with SVE either). And it's not obvious
without looking at the implementation whether subset here means strict subset or not,
so it would obfuscate the clear meaning of the existing patch.

Wilco

Richard Sandiford May 23, 2018, 11:45 a.m. UTC | #3

Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> Richard Sandiford wrote:
>> -  if (allocno_class != ALL_REGS)
>> +  if (allocno_class != POINTER_AND_FP_REGS)
>>      return allocno_class;
>>  
>> -  if (best_class != ALL_REGS)
>> +  if (best_class != POINTER_AND_FP_REGS)
>>      return best_class;
>>  
>>    mode = PSEUDO_REGNO_MODE (regno);
>
>> I think it'd be better to use !reg_class_subset_p (POINTER_AND_FP_REGS, ...)
>> instead of ... != POINTER_AND_FP_REGS, since this in principle still applies
>> to ALL_REGS too.
>> 
>> FWIW, the patch looks good to me with that change.
>
> How does reg_class_subset_p help? In my testing I didn't see ALL_REGS ever
> used (and I don't believe it's possible to get it with SVE either). And
> it's not obvious
> without looking at the implementation whether subset here means strict
> subset or not,
> so it would obfuscate the clear meaning of the existing patch.

But I think the fact that we need this patch shows why hard-coding the
names of union classes is dangerous.  IMO the question isn't whether we
see ALL_REGS used but whether there's a reason in principle why it
wouldn't be used.  E.g. ALL_REGS is the starting class for the
best_class calculation, and LRA uses ALL_REGS as the default choice
for scratch reload registers.

It's not like we can claim that the testsuite will flag up if this
goes wrong again, since AIUI there are no tests that show the reason
we need to make this change.  (I realise the patch includes an md
change to keep the testsuite happy, but that's not the same thing.
I mean more a test that shows why removing the '*'s made things
worse, through no fault of its own.)

Conceptually what we're saying here is that if the given classes
include both GENERAL_REGS and FP_REGS, we'll choose between them
based on the mode of the register.  And that makes sense for any
class that includes both GENERAL_REGS and FP_REGS.  We could write
it that way if it seems better, i.e.:

  if (!reg_class_subset_p (GENERAL_REGS, ...)
      || !reg_class_subset_p (FP_REGS, ...))
    ...

That way we don't mention any union classes, and I think the meaning
is clear in the context of eventually returning GENERAL_REGS or FP_REGS.

reg_class_subset_p tests for the normal inclusive subset relation
rather than "strict subset".

Thanks,
Richard

Wilco Dijkstra May 25, 2018, 1:16 p.m. UTC | #4

Richard Sandiford wrote:

> Conceptually what we're saying here is that if the given classes
> include both GENERAL_REGS and FP_REGS, we'll choose between them
> based on the mode of the register.  And that makes sense for any
> class that includes both GENERAL_REGS and FP_REGS.  We could write
> it that way if it seems better, i.e.:
>
>  if (!reg_class_subset_p (GENERAL_REGS, ...)
>      || !reg_class_subset_p (FP_REGS, ...))
>    ...
>
> That way we don't mention any union classes, and I think the meaning
> is clear in the context of eventually returning GENERAL_REGS or FP_REGS.
>
> reg_class_subset_p tests for the normal inclusive subset relation
> rather than "strict subset".

Right, checking for a subset of GENERAL_REGS and FP_REGS does make sense
and is more clear as well. It appears to behave identically, so here is the new version:


A recent commit removing '*' from the md files caused a large regression in h264ref.
It turns out aarch64_ira_change_pseudo_allocno_class is no longer effective after the
SVE changes, and the combination results in the regression.  This patch fixes it by
explicitly checking for a subset of GENERAL_REGS and FP_REGS.
Add a missing ? to aarch64_get_lane to fix a failure in the testsuite.

Passes regress, OK for commit? Since it is a regression introduced in GCC8, OK to
backport to GCC8?

ChangeLog:
2018-05-25  Wilco Dijkstra  <wdijkstr@arm.com>

	* config/aarch64/aarch64.c (aarch64_ira_change_pseudo_allocno_class):
	Check for subset of GENERAL_REGS and FP_REGS.
	* config/aarch64/aarch64-simd.md (aarch64_get_lane): Increase cost of r=w alternative.

--
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 2ebd256329c1a6a6b790d16955cbcee3feca456c..3d5fe44b53198a92afb726712c6e9dee890afe38 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -2961,7 +2961,7 @@ (define_insn "*aarch64_get_lane_zero_extendsi<mode>"
 ;; is guaranteed so upper bits should be considered undefined.
 ;; RTL uses GCC vector extension indices throughout so flip only for assembly.
 (define_insn "aarch64_get_lane<mode>"
-  [(set (match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand" "=r, w, Utv")
+  [(set (match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand" "=?r, w, Utv")
 	(vec_select:<VEL>
 	  (match_operand:VALL_F16 1 "register_operand" "w, w, w")
 	  (parallel [(match_operand:SI 2 "immediate_operand" "i, i, i")])))]
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 47d98dfd095cdcd15908a86091cf2f8a4d6137b1..6e7722187f0f79195c8b6c43f463a3ac9aa61742 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1059,16 +1059,17 @@ aarch64_err_no_fpadvsimd (machine_mode mode, const char *msg)
 }
 
 /* Implement TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS.
-   The register allocator chooses ALL_REGS if FP_REGS and GENERAL_REGS have
-   the same cost even if ALL_REGS has a much larger cost.  ALL_REGS is also
-   used if the cost of both FP_REGS and GENERAL_REGS is lower than the memory
-   cost (in this case the best class is the lowest cost one).  Using ALL_REGS
-   irrespectively of its cost results in bad allocations with many redundant
-   int<->FP moves which are expensive on various cores.
-   To avoid this we don't allow ALL_REGS as the allocno class, but force a
-   decision between FP_REGS and GENERAL_REGS.  We use the allocno class if it
-   isn't ALL_REGS.  Similarly, use the best class if it isn't ALL_REGS.
-   Otherwise set the allocno class depending on the mode.
+   The register allocator chooses POINTER_AND_FP_REGS if FP_REGS and
+   GENERAL_REGS have the same cost - even if POINTER_AND_FP_REGS has a much
+   higher cost.  POINTER_AND_FP_REGS is also used if the cost of both FP_REGS
+   and GENERAL_REGS is lower than the memory cost (in this case the best class
+   is the lowest cost one).  Using POINTER_AND_FP_REGS irrespectively of its
+   cost results in bad allocations with many redundant int<->FP moves which
+   are expensive on various cores.
+   To avoid this we don't allow POINTER_AND_FP_REGS as the allocno class, but
+   force a decision between FP_REGS and GENERAL_REGS.  We use the allocno class
+   if it isn't POINTER_AND_FP_REGS.  Similarly, use the best class if it isn't
+   POINTER_AND_FP_REGS.  Otherwise set the allocno class depending on the mode.
    The result of this is that it is no longer inefficient to have a higher
    memory move cost than the register move cost.
 */
@@ -1079,10 +1080,12 @@ aarch64_ira_change_pseudo_allocno_class (int regno, reg_class_t allocno_class,
 {
   machine_mode mode;
 
-  if (allocno_class != ALL_REGS)
+  if (reg_class_subset_p (allocno_class, GENERAL_REGS)
+      || reg_class_subset_p (allocno_class, FP_REGS))
     return allocno_class;
 
-  if (best_class != ALL_REGS)
+  if (reg_class_subset_p (best_class, GENERAL_REGS)
+      || reg_class_subset_p (best_class, FP_REGS))
     return best_class;
 
   mode = PSEUDO_REGNO_MODE (regno);

James Greenhalgh May 29, 2018, 3:58 p.m. UTC | #5

On Fri, May 25, 2018 at 08:16:03AM -0500, Wilco Dijkstra wrote:
> Richard Sandiford wrote:
> 
> > Conceptually what we're saying here is that if the given classes
> > include both GENERAL_REGS and FP_REGS, we'll choose between them
> > based on the mode of the register.  And that makes sense for any
> > class that includes both GENERAL_REGS and FP_REGS.  We could write
> > it that way if it seems better, i.e.:
> >
> >  if (!reg_class_subset_p (GENERAL_REGS, ...)
> >      || !reg_class_subset_p (FP_REGS, ...))
> >    ...
> >
> > That way we don't mention any union classes, and I think the meaning
> > is clear in the context of eventually returning GENERAL_REGS or FP_REGS.
> >
> > reg_class_subset_p tests for the normal inclusive subset relation
> > rather than "strict subset".
> 
> Right, checking for a subset of GENERAL_REGS and FP_REGS does make sense
> and is more clear as well. It appears to behave identically, so here is the new version:
> 
> 
> A recent commit removing '*' from the md files caused a large regression in h264ref.
> It turns out aarch64_ira_change_pseudo_allocno_class is no longer effective after the
> SVE changes, and the combination results in the regression.  This patch fixes it by
> explicitly checking for a subset of GENERAL_REGS and FP_REGS.

OK for trunk.

> Add a missing ? to aarch64_get_lane to fix a failure in the testsuite.

I'd prefer more detail than this for a workaround; which test, why did it
start to fail, why is this the right solution, etc.

Thanks,
James

> ChangeLog:
> 2018-05-25  Wilco Dijkstra  <wdijkstr@arm.com>
> 
> 	* config/aarch64/aarch64.c (aarch64_ira_change_pseudo_allocno_class):
> 	Check for subset of GENERAL_REGS and FP_REGS.
> 	* config/aarch64/aarch64-simd.md (aarch64_get_lane): Increase cost of r=w alternative.

Wilco Dijkstra May 29, 2018, 5:34 p.m. UTC | #6

James Greenhalgh wrote:

> > Add a missing ? to aarch64_get_lane to fix a failure in the testsuite.
>
> > I'd prefer more detail than this for a workaround; which test, why did it
> > start to fail, why is this the right solution, etc.

It was gcc.target/aarch64/vect_copy_lane_1.c generating:

test_copy_laneq_f64:
        umov    x0, v1.d[1]
        fmov    d0, x0
        ret

For some reason returning a double uses DImode temporaries, so it's essential
to prefer FP_REGS here and mark the lane copy correctly.

Wilco

Richard Sandiford May 29, 2018, 8:21 p.m. UTC | #7

Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> James Greenhalgh wrote:
>
>> > Add a missing ? to aarch64_get_lane to fix a failure in the testsuite.
>>
>> > I'd prefer more detail than this for a workaround; which test, why did it
>> > start to fail, why is this the right solution, etc.
>
> It was gcc.target/aarch64/vect_copy_lane_1.c generating:
>
> test_copy_laneq_f64:
>         umov    x0, v1.d[1]
>         fmov    d0, x0
>         ret
>
> For some reason returning a double uses DImode temporaries, so it's essential
> to prefer FP_REGS here and mark the lane copy correctly.

The "?" change seems to make intrinsic sense given the extra cost of the
GPR alternative.  But I think the real reason for this failure is that
we define no V1DF patterns, and target-independent code falls back to
using moves in the corresponding *integer* mode.  So for that function
we generate the rather ugly code:

(note 6 1 3 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn 3 6 2 2 (clobber (reg/v:V1DF 92 [ aD.21157 ])) "vect_copy_lane_1.c":45 -1
     (nil))
(insn 2 3 4 2 (set (subreg:DI (reg/v:V1DF 92 [ aD.21157 ]) 0)
        (reg:DI 32 v0 [ aD.21157 ])) "vect_copy_lane_1.c":45 47 {*movdi_aarch64}
     (nil))
(insn 4 2 5 2 (set (reg/v:V2DF 93 [ bD.21158 ])
        (reg:V2DF 33 v1 [ bD.21158 ])) "vect_copy_lane_1.c":45 1063 {*aarch64_simd_movv2df}
     (nil))
(note 5 4 8 2 NOTE_INSN_FUNCTION_BEG)
(insn 8 5 9 2 (set (reg:DF 95)
        (vec_select:DF (reg/v:V2DF 93 [ bD.21158 ])
            (parallel [
                    (const_int 1 [0x1])
                ]))) "./include/arm_neon.h":14441 1993 {aarch64_get_lanev2df}
     (nil))
(insn 9 8 11 2 (set (reg:DI 96)
        (subreg:DI (reg:DF 95) 0)) "vect_copy_lane_1.c":45 47 {*movdi_aarch64}
     (nil))
(insn 11 9 10 2 (clobber (reg:V1DF 91 [ <retval> ])) "vect_copy_lane_1.c":45 -1
     (nil))
(insn 10 11 15 2 (set (subreg:DI (reg:V1DF 91 [ <retval> ]) 0)
        (reg:DI 96)) "vect_copy_lane_1.c":45 47 {*movdi_aarch64}
     (nil))
(insn 15 10 16 2 (set (reg:DI 32 v0)
        (subreg:DI (reg:V1DF 91 [ <retval> ]) 0)) "vect_copy_lane_1.c":45 47 {*movdi_aarch64}
     (nil))
(insn 16 15 0 2 (use (reg/i:V1DF 32 v0)) "vect_copy_lane_1.c":45 -1
     (nil))

which by IRA gets optimised to:

(insn 9 8 15 2 (set (subreg:DF (reg:DI 96) 0)
        (vec_select:DF (reg:V2DF 33 v1 [ bD.21158 ])
            (parallel [
                    (const_int 1 [0x1])
                ]))) "vect_copy_lane_1.c":45 1993 {aarch64_get_lanev2df}
     (expr_list:REG_DEAD (reg:V2DF 33 v1 [ bD.21158 ])
        (nil)))
(insn 15 9 16 2 (set (reg:DI 32 v0)
        (reg:DI 96)) "vect_copy_lane_1.c":45 47 {*movdi_aarch64}
     (expr_list:REG_DEAD (reg:DI 96)
        (nil)))
(insn 16 15 18 2 (use (reg/i:V1DF 32 v0)) "vect_copy_lane_1.c":45 -1
     (nil))

with the move now being done purely in DImode.  This defeats the
heuristic in aarch64_ira_change_pseudo_allocno_class because the
pseudo appears to be a normal integer rather than a (float) vector.

Although the "?" fixes this particular instance, I think more
complicated V1DF code would still regress by being forced to
use GENERAL_REGS.  Of course, the fix is to add the move pattern
rather than disable the heuristic...

Thanks,
Richard

Wilco Dijkstra May 30, 2018, 10:26 a.m. UTC | #8

Richard Sandiford <richard.sandiford@linaro.org>

> The "?" change seems to make intrinsic sense given the extra cost of the
> GPR alternative.  But I think the real reason for this failure is that
> we define no V1DF patterns, and target-independent code falls back to
> using moves in the corresponding *integer* mode.  So for that function
> we generate the rather ugly code:

This:

typedef struct { double x; } X;
X f2(X *p)
{
  return *p;
}

emits at expand:

(insn 6 3 7 2 (set (reg:DF 90 [ D.21009 ])
        (mem:DF (reg/v/f:DI 92 [ p ]) [2 *p_2(D)+0 S8 A64])) "vect_copy_lane_1.c":26 -1
     (nil))
(insn 7 6 8 2 (set (subreg:DF (reg:DI 94) 0)
        (reg:DF 90 [ D.21009 ])) "vect_copy_lane_1.c":26 -1
     (nil))
(insn 8 7 9 2 (set (reg:DI 95)
        (reg:DI 94)) "vect_copy_lane_1.c":26 -1
     (nil))
(insn 9 8 13 2 (set (reg:DF 91 [ <retval> ])
        (subreg:DF (reg:DI 95) 0)) "vect_copy_lane_1.c":26 -1
     (nil))

So the underlying cause is the structure passing code. Things get worse when you return
2 doubles and it really becomes horrific at 3...

Wilco

Richard Sandiford May 30, 2018, 6:47 p.m. UTC | #9

Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> Richard Sandiford <richard.sandiford@linaro.org>
>> The "?" change seems to make intrinsic sense given the extra cost of the
>> GPR alternative.  But I think the real reason for this failure is that
>> we define no V1DF patterns, and target-independent code falls back to
>> using moves in the corresponding *integer* mode.  So for that function
>> we generate the rather ugly code:
>
> This:
>
> typedef struct { double x; } X;
> X f2(X *p)
> {
>   return *p;
> }
>
> emits at expand:
>
> (insn 6 3 7 2 (set (reg:DF 90 [ D.21009 ])
>         (mem:DF (reg/v/f:DI 92 [ p ]) [2 *p_2(D)+0 S8 A64])) "vect_copy_lane_1.c":26 -1
>      (nil))
> (insn 7 6 8 2 (set (subreg:DF (reg:DI 94) 0)
>         (reg:DF 90 [ D.21009 ])) "vect_copy_lane_1.c":26 -1
>      (nil))
> (insn 8 7 9 2 (set (reg:DI 95)
>         (reg:DI 94)) "vect_copy_lane_1.c":26 -1
>      (nil))
> (insn 9 8 13 2 (set (reg:DF 91 [ <retval> ])
>         (subreg:DF (reg:DI 95) 0)) "vect_copy_lane_1.c":26 -1
>      (nil))
>
> So the underlying cause is the structure passing code. Things get
> worse when you return 2 doubles and it really becomes horrific at 3...

Yeah, the handling of structures can also be poor, but float64x1_t is a
vector type rather than a structure, so I don't think the above is the
problem in the specific case of test_copy_laneq_f64.

float64x1_t has the TYPE_MODE we want (V1DF).  But because we have
no V1DF move pattern, it ends up being moved as a DI instead.

Thanks,
Richard

Christophe Lyon May 31, 2018, 8:35 a.m. UTC | #10

On 29 May 2018 at 19:34, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> James Greenhalgh wrote:
>
>> > Add a missing ? to aarch64_get_lane to fix a failure in the testsuite.
>>
>> > I'd prefer more detail than this for a workaround; which test, why did it
>> > start to fail, why is this the right solution, etc.
>
> It was gcc.target/aarch64/vect_copy_lane_1.c generating:
>
> test_copy_laneq_f64:
>         umov    x0, v1.d[1]
>         fmov    d0, x0
>         ret
>
> For some reason returning a double uses DImode temporaries, so it's essential
> to prefer FP_REGS here and mark the lane copy correctly.
>
> Wilco
>

Hi Wilco,

This has probably been reported elsewhere already but I can't find
such a report, so sorry for possible duplicate,
but this patch is causing ICEs on aarch64
FAIL:    gcc.target/aarch64/sve/reduc_1.c -march=armv8.2-a+sve
(internal compiler error)
FAIL:    gcc.target/aarch64/sve/reduc_5.c -march=armv8.2-a+sve
(internal compiler error)

and also many scan-assembler regressions:

http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/260951/report-build-info.html

Can you check?

Thanks

Christophe

Richard Sandiford May 31, 2018, 9:27 a.m. UTC | #11

Christophe Lyon <christophe.lyon@linaro.org> writes:
> On 29 May 2018 at 19:34, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>> James Greenhalgh wrote:
>>
>>> > Add a missing ? to aarch64_get_lane to fix a failure in the testsuite.
>>>
>>> > I'd prefer more detail than this for a workaround; which test, why did it
>>> > start to fail, why is this the right solution, etc.
>>
>> It was gcc.target/aarch64/vect_copy_lane_1.c generating:
>>
>> test_copy_laneq_f64:
>>         umov    x0, v1.d[1]
>>         fmov    d0, x0
>>         ret
>>
>> For some reason returning a double uses DImode temporaries, so it's essential
>> to prefer FP_REGS here and mark the lane copy correctly.
>>
>> Wilco
>>
>
> Hi Wilco,
>
> This has probably been reported elsewhere already but I can't find
> such a report, so sorry for possible duplicate,
> but this patch is causing ICEs on aarch64
> FAIL:    gcc.target/aarch64/sve/reduc_1.c -march=armv8.2-a+sve
> (internal compiler error)
> FAIL:    gcc.target/aarch64/sve/reduc_5.c -march=armv8.2-a+sve
> (internal compiler error)
>
> and also many scan-assembler regressions:
>
> http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/260951/report-build-info.html

Thanks for the heads-up.  Looks like they're all SVE, so I'll take this.

Richard

Wilco Dijkstra May 31, 2018, 11:16 a.m. UTC | #12

Richard Sandiford wrote:

>> This has probably been reported elsewhere already but I can't find
>> such a report, so sorry for possible duplicate,
>> but this patch is causing ICEs on aarch64
>> FAIL:    gcc.target/aarch64/sve/reduc_1.c -march=armv8.2-a+sve
>> (internal compiler error)
>> FAIL:    gcc.target/aarch64/sve/reduc_5.c -march=armv8.2-a+sve
>> (internal compiler error)
>>
>> and also many scan-assembler regressions:
>>
>>  http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/260951/report-build-info.html
>
> Thanks for the heads-up.  Looks like they're all SVE, so I'll take this.

It seems this is due to unnecessary spills of PR_REGS - the subset doesn't work for those.
The original proposal doing:

  if (allocno_class != POINTER_AND_FP_REGS)
    return allocno_class;

doesn't appear to affect SVE. However the question is whether the register allocator
can get confused about PR_REGS and end up with POINTER_AND_FP_REGS for
both the allocno_class and best_class? If so then the return needs to support predicate
modes too.

Wilco

Richard Sandiford May 31, 2018, 12:23 p.m. UTC | #13

Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> Richard Sandiford wrote:
>
>>> This has probably been reported elsewhere already but I can't find
>>> such a report, so sorry for possible duplicate,
>>> but this patch is causing ICEs on aarch64
>>> FAIL:    gcc.target/aarch64/sve/reduc_1.c -march=armv8.2-a+sve
>>> (internal compiler error)
>>> FAIL:    gcc.target/aarch64/sve/reduc_5.c -march=armv8.2-a+sve
>>> (internal compiler error)
>>>
>>> and also many scan-assembler regressions:
>>>
>>>  http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/260951/report-build-info.html
>>
>> Thanks for the heads-up.  Looks like they're all SVE, so I'll take this.
>
> It seems this is due to unnecessary spills of PR_REGS - the subset doesn't work for those.

It does, but I'd originally suggested:

  if (!reg_class_subset_p (GENERAL_REGS, ...)
      || !reg_class_subset_p (FP_REGS, ...))
    ...bail out...

whereas the committed patch had:

  if (reg_class_subset_p (..., GENERAL_REGS)
      || reg_class_subset_p (..., FP_REGS))
    ...bail out...

That's an important difference.  The idea with the first was that
we should only make a choice between GENERAL_REGS and FP_REGS
if the original classes included both of them.  And that's what
we want because the new class has to be a refinement of the
original: it shouldn't include entirely new registers.

The committed version instead says that we won't make a choice
between GENERAL_REGS and FP_REGS if one of the classes is already
specific to one of them.  I think this would also lead to us changing
POINTER_REGS to GENERAL_REGS, although I don't know how much that
matters in practice.

> The original proposal doing:
>
>   if (allocno_class != POINTER_AND_FP_REGS)
>     return allocno_class;
>
> doesn't appear to affect SVE. However the question is whether the
> register allocator can get confused about PR_REGS and end up with
> POINTER_AND_FP_REGS for both the allocno_class and best_class? If so
> then the return needs to support predicate modes too.

Yeah, that shouldn't happen, since predicate modes are only allowed in
predicate registers.

I think the reduc_1 ICE is a separate bug that I'll post a patch for,
but it goes latent again after the patch below.

Tested on aarch64-linux-gnu.  I don't think it can be called obvious
given the above, and it's only SVE-specifc by chance, so: OK to install?

Thanks,
Richard


2018-05-31  Richard Sandiford  <richard.sandiford@linaro.org>

gcc/
	* config/aarch64/aarch64.c (aarch64_ira_change_pseudo_allocno_class):
	Fix subreg tests so that we only return a choice between
	GENERAL_REGS and FP_REGS if the original classes included both.

Index: gcc/config/aarch64/aarch64.c
===================================================================
--- gcc/config/aarch64/aarch64.c	2018-05-30 19:31:14.212387813 +0100
+++ gcc/config/aarch64/aarch64.c	2018-05-31 13:12:56.836974021 +0100
@@ -1108,12 +1108,12 @@ aarch64_ira_change_pseudo_allocno_class
 {
   machine_mode mode;
 
-  if (reg_class_subset_p (allocno_class, GENERAL_REGS)
-      || reg_class_subset_p (allocno_class, FP_REGS))
+  if (!reg_class_subset_p (GENERAL_REGS, allocno_class)
+      || !reg_class_subset_p (FP_REGS, allocno_class))
     return allocno_class;
 
-  if (reg_class_subset_p (best_class, GENERAL_REGS)
-      || reg_class_subset_p (best_class, FP_REGS))
+  if (!reg_class_subset_p (GENERAL_REGS, best_class)
+      || !reg_class_subset_p (FP_REGS, best_class))
     return best_class;
 
   mode = PSEUDO_REGNO_MODE (regno);

James Greenhalgh May 31, 2018, 7:26 p.m. UTC | #14

On Thu, May 31, 2018 at 07:23:29AM -0500, Richard Sandiford wrote:
> Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> > Richard Sandiford wrote:
> >
> >>> This has probably been reported elsewhere already but I can't find
> >>> such a report, so sorry for possible duplicate,
> >>> but this patch is causing ICEs on aarch64
> >>> FAIL:    gcc.target/aarch64/sve/reduc_1.c -march=armv8.2-a+sve
> >>> (internal compiler error)
> >>> FAIL:    gcc.target/aarch64/sve/reduc_5.c -march=armv8.2-a+sve
> >>> (internal compiler error)
> >>>
> >>> and also many scan-assembler regressions:
> >>>
> >>>  http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/260951/report-build-info.html
> >>
> >> Thanks for the heads-up.  Looks like they're all SVE, so I'll take this.
> >
> > It seems this is due to unnecessary spills of PR_REGS - the subset doesn't work for those.
> 
> It does, but I'd originally suggested:
> 
>   if (!reg_class_subset_p (GENERAL_REGS, ...)
>       || !reg_class_subset_p (FP_REGS, ...))
>     ...bail out...
> 
> whereas the committed patch had:
> 
>   if (reg_class_subset_p (..., GENERAL_REGS)
>       || reg_class_subset_p (..., FP_REGS))
>     ...bail out...
> 
> That's an important difference.  The idea with the first was that
> we should only make a choice between GENERAL_REGS and FP_REGS
> if the original classes included both of them.  And that's what
> we want because the new class has to be a refinement of the
> original: it shouldn't include entirely new registers.
> 
> The committed version instead says that we won't make a choice
> between GENERAL_REGS and FP_REGS if one of the classes is already
> specific to one of them.  I think this would also lead to us changing
> POINTER_REGS to GENERAL_REGS, although I don't know how much that
> matters in practice.

Sorry to have missed this detail in review.

> > The original proposal doing:
> >
> >   if (allocno_class != POINTER_AND_FP_REGS)
> >     return allocno_class;
> >
> > doesn't appear to affect SVE. However the question is whether the
> > register allocator can get confused about PR_REGS and end up with
> > POINTER_AND_FP_REGS for both the allocno_class and best_class? If so
> > then the return needs to support predicate modes too.
> 
> Yeah, that shouldn't happen, since predicate modes are only allowed in
> predicate registers.
> 
> I think the reduc_1 ICE is a separate bug that I'll post a patch for,
> but it goes latent again after the patch below.
> 
> Tested on aarch64-linux-gnu.  I don't think it can be called obvious
> given the above, and it's only SVE-specifc by chance, so: OK to install?

This is OK for trunk.

Thanks,
James

> 2018-05-31  Richard Sandiford  <richard.sandiford@linaro.org>
> 
> gcc/
> 	* config/aarch64/aarch64.c (aarch64_ira_change_pseudo_allocno_class):
> 	Fix subreg tests so that we only return a choice between
> 	GENERAL_REGS and FP_REGS if the original classes included both.

[AArch64] Fix aarch64_ira_change_pseudo_allocno_class

Commit Message

Comments

Patch