diff mbox series

[2/3] arm: Remove unnecessary zero-extending of MVE predicates before use [PR 107674]

Message ID 22ba05fb-774e-62b8-64a2-90c5d73fcaba@arm.com
State New
Headers show
Series arm: Fix regressions around MVE predicate codegen | expand

Commit Message

Andre Vieira (lists) Jan. 24, 2023, 1:54 p.m. UTC
Hi,

This patch teaches GCC that zero-extending a MVE predicate from 16-bits 
to 32-bits and then only using 16-bits is a no-op.
It does so in two steps:
- it lets gcc know that it can access any MVE predicate mode using any 
other MVE predicate mode without needing to copy it, using the 
TARGET_MODES_TIEABLE_P hook,
- it teaches simplify_subreg to optimize a subreg with a vector 
outermode, by replacing this outermode with a same-sized integer mode 
and trying the avalailable optimizations, then if successful it 
surrounds the result with a subreg casting it back to the original 
vector outermode.

This removes the unnecessary zero-extending shown on PR 107674 (though 
it's a sign-extend there), that was introduced in gcc 11.

Bootstrapped on aarch64-none-linux-gnu and regression tested on 
arm-none-eabi and armeb-none-eabi for armv8.1-m.main+mve.fp.

OK for trunk?

gcc/ChangeLog:

	PR target/107674
         * conig/arm/arm.cc (arm_hard_regno_mode_ok): Use new MACRO.
         (arm_modes_tieable_p): Make MVE predicate modes tieable.
	* config/arm/arm.h (VALID_MVE_PRED_MODE):  New define.
	* simplify-rtx.cc (simplify_context::simplify_subreg): Teach
	simplify_subreg to simplify subregs where the outermode is not scalar.

gcc/testsuite/ChangeLog:

	* gcc.target/arm/mve/mve_vpt.c: Change to remove unecessary
	zero-extend.

Comments

Kyrylo Tkachov Jan. 26, 2023, 3:06 p.m. UTC | #1
Hi Andre,

> -----Original Message-----
> From: Andre Vieira (lists) <andre.simoesdiasvieira@arm.com>
> Sent: Tuesday, January 24, 2023 1:54 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Richard Earnshaw
> <Richard.Earnshaw@arm.com>; Richard Biener <rguenther@suse.de>;
> Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> Subject: [PATCH 2/3] arm: Remove unnecessary zero-extending of MVE
> predicates before use [PR 107674]
> 
> Hi,
> 
> This patch teaches GCC that zero-extending a MVE predicate from 16-bits
> to 32-bits and then only using 16-bits is a no-op.
> It does so in two steps:
> - it lets gcc know that it can access any MVE predicate mode using any
> other MVE predicate mode without needing to copy it, using the
> TARGET_MODES_TIEABLE_P hook,
> - it teaches simplify_subreg to optimize a subreg with a vector
> outermode, by replacing this outermode with a same-sized integer mode
> and trying the avalailable optimizations, then if successful it
> surrounds the result with a subreg casting it back to the original
> vector outermode.
> 
> This removes the unnecessary zero-extending shown on PR 107674 (though
> it's a sign-extend there), that was introduced in gcc 11.
> 
> Bootstrapped on aarch64-none-linux-gnu and regression tested on
> arm-none-eabi and armeb-none-eabi for armv8.1-m.main+mve.fp.
> 
> OK for trunk?
> 
> gcc/ChangeLog:
> 
> 	PR target/107674
>          * conig/arm/arm.cc (arm_hard_regno_mode_ok): Use new MACRO.
>          (arm_modes_tieable_p): Make MVE predicate modes tieable.
> 	* config/arm/arm.h (VALID_MVE_PRED_MODE):  New define.
> 	* simplify-rtx.cc (simplify_context::simplify_subreg): Teach
> 	simplify_subreg to simplify subregs where the outermode is not
> scalar.

The arm changes look ok to me. We'll want a midend maintainer to have a look at simplify-rtx.cc

> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/arm/mve/mve_vpt.c: Change to remove unecessary
> 	zero-extend.

diff --git a/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c b/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
index 26a565b79dd1348e361b3aa23a1d6e6d13bffce8..8e562a9f065eff157f63ebd5acf9af0a2155b5c5 100644
--- a/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
+++ b/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
@@ -16,9 +16,6 @@ void test0 (uint8_t *a, uint8_t *b, uint8_t *c)
 **	vldrb.8	q2, \[r0\]
 **	vldrb.8	q1, \[r1\]
 **	vcmp.i8	eq, q2, q1
-**	vmrs	r3, p0	@ movhi
-**	uxth	r3, r3
-**	vmsr	p0, r3	@ movhi
 **	vpst
 **	vaddt.i8	q3, q2, q1
 **	vpst

Ah I see, that's the testcase from patch 1/3 that I criticized :)
Maybe if we just scan for absence of an uxth, vmrs and vmsr it will be more robust?
Thanks,
Kyrill
Andre Vieira (lists) Jan. 27, 2023, 9:58 a.m. UTC | #2
On 26/01/2023 15:06, Kyrylo Tkachov wrote:
> Hi Andre,
> 
>> -----Original Message-----
>> From: Andre Vieira (lists) <andre.simoesdiasvieira@arm.com>
>> Sent: Tuesday, January 24, 2023 1:54 PM
>> To: gcc-patches@gcc.gnu.org
>> Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Richard Earnshaw
>> <Richard.Earnshaw@arm.com>; Richard Biener <rguenther@suse.de>;
>> Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
>> Subject: [PATCH 2/3] arm: Remove unnecessary zero-extending of MVE
>> predicates before use [PR 107674]
>>
>> Hi,
>>
>> This patch teaches GCC that zero-extending a MVE predicate from 16-bits
>> to 32-bits and then only using 16-bits is a no-op.
>> It does so in two steps:
>> - it lets gcc know that it can access any MVE predicate mode using any
>> other MVE predicate mode without needing to copy it, using the
>> TARGET_MODES_TIEABLE_P hook,
>> - it teaches simplify_subreg to optimize a subreg with a vector
>> outermode, by replacing this outermode with a same-sized integer mode
>> and trying the avalailable optimizations, then if successful it
>> surrounds the result with a subreg casting it back to the original
>> vector outermode.
>>
>> This removes the unnecessary zero-extending shown on PR 107674 (though
>> it's a sign-extend there), that was introduced in gcc 11.
>>
>> Bootstrapped on aarch64-none-linux-gnu and regression tested on
>> arm-none-eabi and armeb-none-eabi for armv8.1-m.main+mve.fp.
>>
>> OK for trunk?
>>
>> gcc/ChangeLog:
>>
>> 	PR target/107674
>>           * conig/arm/arm.cc (arm_hard_regno_mode_ok): Use new MACRO.
>>           (arm_modes_tieable_p): Make MVE predicate modes tieable.
>> 	* config/arm/arm.h (VALID_MVE_PRED_MODE):  New define.
>> 	* simplify-rtx.cc (simplify_context::simplify_subreg): Teach
>> 	simplify_subreg to simplify subregs where the outermode is not
>> scalar.
> 
> The arm changes look ok to me. We'll want a midend maintainer to have a look at simplify-rtx.cc
> 
>>
>> gcc/testsuite/ChangeLog:
>>
>> 	* gcc.target/arm/mve/mve_vpt.c: Change to remove unecessary
>> 	zero-extend.
> 
> diff --git a/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c b/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
> index 26a565b79dd1348e361b3aa23a1d6e6d13bffce8..8e562a9f065eff157f63ebd5acf9af0a2155b5c5 100644
> --- a/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
> +++ b/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
> @@ -16,9 +16,6 @@ void test0 (uint8_t *a, uint8_t *b, uint8_t *c)
>   **	vldrb.8	q2, \[r0\]
>   **	vldrb.8	q1, \[r1\]
>   **	vcmp.i8	eq, q2, q1
> -**	vmrs	r3, p0	@ movhi
> -**	uxth	r3, r3
> -**	vmsr	p0, r3	@ movhi
>   **	vpst
>   **	vaddt.i8	q3, q2, q1
>   **	vpst
> 
> Ah I see, that's the testcase from patch 1/3 that I criticized :)
> Maybe if we just scan for absence of an uxth, vmrs and vmsr it will be more robust?
> Thanks,
> Kyrill
I could, but I would rather not. I have a patch series waiting for GCC 
14 that does further improvements to this (and other VPST codegen) 
sequences and if I do scan for 'absence' of an instruction I have to 
break them up into single tests each. Also it wouldn't then fail if we 
start spilling the predicate directly to memory for instance. Like I 
mentioned in the previous patch, the sequence is unlikely to be able to 
change through scheduling (other than maybe the reordering of the loads 
through some bad luck, but I could make it robust to that).
Kyrylo Tkachov Jan. 27, 2023, 9:59 a.m. UTC | #3
> -----Original Message-----
> From: Andre Vieira (lists) <andre.simoesdiasvieira@arm.com>
> Sent: Friday, January 27, 2023 9:58 AM
> To: Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>; gcc-patches@gcc.gnu.org
> Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Richard Earnshaw
> <Richard.Earnshaw@arm.com>; Richard Biener <rguenther@suse.de>
> Subject: Re: [PATCH 2/3] arm: Remove unnecessary zero-extending of MVE
> predicates before use [PR 107674]
> 
> 
> 
> On 26/01/2023 15:06, Kyrylo Tkachov wrote:
> > Hi Andre,
> >
> >> -----Original Message-----
> >> From: Andre Vieira (lists) <andre.simoesdiasvieira@arm.com>
> >> Sent: Tuesday, January 24, 2023 1:54 PM
> >> To: gcc-patches@gcc.gnu.org
> >> Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Richard Earnshaw
> >> <Richard.Earnshaw@arm.com>; Richard Biener <rguenther@suse.de>;
> >> Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> >> Subject: [PATCH 2/3] arm: Remove unnecessary zero-extending of MVE
> >> predicates before use [PR 107674]
> >>
> >> Hi,
> >>
> >> This patch teaches GCC that zero-extending a MVE predicate from 16-bits
> >> to 32-bits and then only using 16-bits is a no-op.
> >> It does so in two steps:
> >> - it lets gcc know that it can access any MVE predicate mode using any
> >> other MVE predicate mode without needing to copy it, using the
> >> TARGET_MODES_TIEABLE_P hook,
> >> - it teaches simplify_subreg to optimize a subreg with a vector
> >> outermode, by replacing this outermode with a same-sized integer mode
> >> and trying the avalailable optimizations, then if successful it
> >> surrounds the result with a subreg casting it back to the original
> >> vector outermode.
> >>
> >> This removes the unnecessary zero-extending shown on PR 107674
> (though
> >> it's a sign-extend there), that was introduced in gcc 11.
> >>
> >> Bootstrapped on aarch64-none-linux-gnu and regression tested on
> >> arm-none-eabi and armeb-none-eabi for armv8.1-m.main+mve.fp.
> >>
> >> OK for trunk?
> >>
> >> gcc/ChangeLog:
> >>
> >> 	PR target/107674
> >>           * conig/arm/arm.cc (arm_hard_regno_mode_ok): Use new MACRO.
> >>           (arm_modes_tieable_p): Make MVE predicate modes tieable.
> >> 	* config/arm/arm.h (VALID_MVE_PRED_MODE):  New define.
> >> 	* simplify-rtx.cc (simplify_context::simplify_subreg): Teach
> >> 	simplify_subreg to simplify subregs where the outermode is not
> >> scalar.
> >
> > The arm changes look ok to me. We'll want a midend maintainer to have a
> look at simplify-rtx.cc
> >
> >>
> >> gcc/testsuite/ChangeLog:
> >>
> >> 	* gcc.target/arm/mve/mve_vpt.c: Change to remove unecessary
> >> 	zero-extend.
> >
> > diff --git a/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
> b/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
> > index
> 26a565b79dd1348e361b3aa23a1d6e6d13bffce8..8e562a9f065eff157f63ebd5
> acf9af0a2155b5c5 100644
> > --- a/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
> > +++ b/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
> > @@ -16,9 +16,6 @@ void test0 (uint8_t *a, uint8_t *b, uint8_t *c)
> >   **	vldrb.8	q2, \[r0\]
> >   **	vldrb.8	q1, \[r1\]
> >   **	vcmp.i8	eq, q2, q1
> > -**	vmrs	r3, p0	@ movhi
> > -**	uxth	r3, r3
> > -**	vmsr	p0, r3	@ movhi
> >   **	vpst
> >   **	vaddt.i8	q3, q2, q1
> >   **	vpst
> >
> > Ah I see, that's the testcase from patch 1/3 that I criticized :)
> > Maybe if we just scan for absence of an uxth, vmrs and vmsr it will be more
> robust?
> > Thanks,
> > Kyrill
> I could, but I would rather not. I have a patch series waiting for GCC
> 14 that does further improvements to this (and other VPST codegen)
> sequences and if I do scan for 'absence' of an instruction I have to
> break them up into single tests each. Also it wouldn't then fail if we
> start spilling the predicate directly to memory for instance. Like I
> mentioned in the previous patch, the sequence is unlikely to be able to
> change through scheduling (other than maybe the reordering of the loads
> through some bad luck, but I could make it robust to that).

Ok, looks like it was thought through, so fine by me.
Thanks,
Kyrill
Andre Vieira (lists) Jan. 30, 2023, 4:41 p.m. UTC | #4
Changed the testcase to be more robust (as per the discussion for the 
first patch).

Still need the OK for the mid-end (simplify-rtx) part.

Kind regards,
Andre

On 27/01/2023 09:59, Kyrylo Tkachov wrote:
> 
> 
>> -----Original Message-----
>> From: Andre Vieira (lists) <andre.simoesdiasvieira@arm.com>
>> Sent: Friday, January 27, 2023 9:58 AM
>> To: Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>; gcc-patches@gcc.gnu.org
>> Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Richard Earnshaw
>> <Richard.Earnshaw@arm.com>; Richard Biener <rguenther@suse.de>
>> Subject: Re: [PATCH 2/3] arm: Remove unnecessary zero-extending of MVE
>> predicates before use [PR 107674]
>>
>>
>>
>> On 26/01/2023 15:06, Kyrylo Tkachov wrote:
>>> Hi Andre,
>>>
>>>> -----Original Message-----
>>>> From: Andre Vieira (lists) <andre.simoesdiasvieira@arm.com>
>>>> Sent: Tuesday, January 24, 2023 1:54 PM
>>>> To: gcc-patches@gcc.gnu.org
>>>> Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Richard Earnshaw
>>>> <Richard.Earnshaw@arm.com>; Richard Biener <rguenther@suse.de>;
>>>> Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
>>>> Subject: [PATCH 2/3] arm: Remove unnecessary zero-extending of MVE
>>>> predicates before use [PR 107674]
>>>>
>>>> Hi,
>>>>
>>>> This patch teaches GCC that zero-extending a MVE predicate from 16-bits
>>>> to 32-bits and then only using 16-bits is a no-op.
>>>> It does so in two steps:
>>>> - it lets gcc know that it can access any MVE predicate mode using any
>>>> other MVE predicate mode without needing to copy it, using the
>>>> TARGET_MODES_TIEABLE_P hook,
>>>> - it teaches simplify_subreg to optimize a subreg with a vector
>>>> outermode, by replacing this outermode with a same-sized integer mode
>>>> and trying the avalailable optimizations, then if successful it
>>>> surrounds the result with a subreg casting it back to the original
>>>> vector outermode.
>>>>
>>>> This removes the unnecessary zero-extending shown on PR 107674
>> (though
>>>> it's a sign-extend there), that was introduced in gcc 11.
>>>>
>>>> Bootstrapped on aarch64-none-linux-gnu and regression tested on
>>>> arm-none-eabi and armeb-none-eabi for armv8.1-m.main+mve.fp.
>>>>
>>>> OK for trunk?
>>>>
>>>> gcc/ChangeLog:
>>>>
>>>> 	PR target/107674
>>>>            * conig/arm/arm.cc (arm_hard_regno_mode_ok): Use new MACRO.
>>>>            (arm_modes_tieable_p): Make MVE predicate modes tieable.
>>>> 	* config/arm/arm.h (VALID_MVE_PRED_MODE):  New define.
>>>> 	* simplify-rtx.cc (simplify_context::simplify_subreg): Teach
>>>> 	simplify_subreg to simplify subregs where the outermode is not
>>>> scalar.
>>>
>>> The arm changes look ok to me. We'll want a midend maintainer to have a
>> look at simplify-rtx.cc
>>>
>>>>
>>>> gcc/testsuite/ChangeLog:
>>>>
>>>> 	* gcc.target/arm/mve/mve_vpt.c: Change to remove unecessary
>>>> 	zero-extend.
>>>
>>> diff --git a/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
>> b/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
>>> index
>> 26a565b79dd1348e361b3aa23a1d6e6d13bffce8..8e562a9f065eff157f63ebd5
>> acf9af0a2155b5c5 100644
>>> --- a/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
>>> +++ b/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
>>> @@ -16,9 +16,6 @@ void test0 (uint8_t *a, uint8_t *b, uint8_t *c)
>>>    **	vldrb.8	q2, \[r0\]
>>>    **	vldrb.8	q1, \[r1\]
>>>    **	vcmp.i8	eq, q2, q1
>>> -**	vmrs	r3, p0	@ movhi
>>> -**	uxth	r3, r3
>>> -**	vmsr	p0, r3	@ movhi
>>>    **	vpst
>>>    **	vaddt.i8	q3, q2, q1
>>>    **	vpst
>>>
>>> Ah I see, that's the testcase from patch 1/3 that I criticized :)
>>> Maybe if we just scan for absence of an uxth, vmrs and vmsr it will be more
>> robust?
>>> Thanks,
>>> Kyrill
>> I could, but I would rather not. I have a patch series waiting for GCC
>> 14 that does further improvements to this (and other VPST codegen)
>> sequences and if I do scan for 'absence' of an instruction I have to
>> break them up into single tests each. Also it wouldn't then fail if we
>> start spilling the predicate directly to memory for instance. Like I
>> mentioned in the previous patch, the sequence is unlikely to be able to
>> change through scheduling (other than maybe the reordering of the loads
>> through some bad luck, but I could make it robust to that).
> 
> Ok, looks like it was thought through, so fine by me.
> Thanks,
> Kyrill
diff --git a/gcc/config/arm/arm.h b/gcc/config/arm/arm.h
index 632728371d5cef364e47bf33bfa0faba738db871..8325e7a876e2e03f14cba07385cc5a1ddd771655 100644
--- a/gcc/config/arm/arm.h
+++ b/gcc/config/arm/arm.h
@@ -1104,6 +1104,10 @@ extern const int arm_arch_cde_coproc_bits[];
    || (MODE) == V16QImode || (MODE) == V8HFmode || (MODE) == V4SFmode \
    || (MODE) == V2DFmode)
 
+#define VALID_MVE_PRED_MODE(MODE) \
+  ((MODE) == HImode							\
+   || (MODE) == V16BImode || (MODE) == V8BImode || (MODE) == V4BImode)
+
 #define VALID_MVE_SI_MODE(MODE) \
   ((MODE) == V2DImode ||(MODE) == V4SImode || (MODE) == V8HImode \
    || (MODE) == V16QImode)
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index efc48349dd3508e6790c1a9f3bba5da689a986bc..4d9d202cad1f39ba386df9d8e4277007fd960262 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -25656,10 +25656,7 @@ arm_hard_regno_mode_ok (unsigned int regno, machine_mode mode)
     return false;
 
   if (IS_VPR_REGNUM (regno))
-    return mode == HImode
-      || mode == V16BImode
-      || mode == V8BImode
-      || mode == V4BImode;
+    return VALID_MVE_PRED_MODE (mode);
 
   if (TARGET_THUMB1)
     /* For the Thumb we only allow values bigger than SImode in
@@ -25738,6 +25735,10 @@ arm_modes_tieable_p (machine_mode mode1, machine_mode mode2)
   if (GET_MODE_CLASS (mode1) == GET_MODE_CLASS (mode2))
     return true;
 
+  if (TARGET_HAVE_MVE
+      && (VALID_MVE_PRED_MODE (mode1) && VALID_MVE_PRED_MODE (mode2)))
+    return true;
+
   /* We specifically want to allow elements of "structure" modes to
      be tieable to the structure.  This more general condition allows
      other rarer situations too.  */
diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index 7fb1e97fbea4e7b8b091f5724ebe0cb61eee7ec3..a951272186585c0a5cc3e0155285e7a635865f42 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -7652,6 +7652,22 @@ simplify_context::simplify_subreg (machine_mode outermode, rtx op,
 	}
     }
 
+  /* Try simplifying a SUBREG expression of a non-integer OUTERMODE by using a
+     NEW_OUTERMODE of the same size instead, other simplifications rely on
+     integer to integer subregs and we'd potentially miss out on optimizations
+     otherwise.  */
+  if (known_gt (GET_MODE_SIZE (innermode),
+		GET_MODE_SIZE (outermode))
+      && SCALAR_INT_MODE_P (innermode)
+      && !SCALAR_INT_MODE_P (outermode)
+      && int_mode_for_size (GET_MODE_BITSIZE (outermode),
+			    0).exists (&int_outermode))
+    {
+      rtx tem = simplify_subreg (int_outermode, op, innermode, byte);
+      if (tem)
+	return simplify_gen_subreg (outermode, tem, GET_MODE (tem), byte);
+    }
+
   /* If OP is a vector comparison and the subreg is not changing the
      number of elements or the size of the elements, change the result
      of the comparison to the new mode.  */
diff --git a/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c b/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
index 28e4697c3c5bcc89b37fcb296f4b46c861aed27d..41f4e3805d62d0343c4035a328250fb8c7b0c47f 100644
--- a/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
+++ b/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
@@ -16,12 +16,9 @@ void test0 (uint8_t *a, uint8_t *b, uint8_t *c)
 **	vldrb.8	q[0-9]+, \[r[0-9]+\]
 **	vldrb.8	q[0-9]+, \[r[0-9]+\]
 **	vcmp.i8	eq, q[0-9]+, q[0-9]+
-**	vmrs	(r[0-9]+), p0	@ movhi
-**	uxth	\1, \1
-**	vmsr	p0, \1	@ movhi
 **	vpst
 **	vaddt.i8	(q[0-9]+), q[0-9]+, q[0-9]+
 **	vpst
-**	vstrbt.8	\2, \[r[0-9]+\]
+**	vstrbt.8	\1, \[r[0-9]+\]
 **	bx	lr
 */
Richard Sandiford Jan. 30, 2023, 11:17 p.m. UTC | #5
"Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
> Hi,
>
> This patch teaches GCC that zero-extending a MVE predicate from 16-bits 
> to 32-bits and then only using 16-bits is a no-op.
> It does so in two steps:
> - it lets gcc know that it can access any MVE predicate mode using any 
> other MVE predicate mode without needing to copy it, using the 
> TARGET_MODES_TIEABLE_P hook,
> - it teaches simplify_subreg to optimize a subreg with a vector 
> outermode, by replacing this outermode with a same-sized integer mode 
> and trying the avalailable optimizations, then if successful it 
> surrounds the result with a subreg casting it back to the original 
> vector outermode.
>
> This removes the unnecessary zero-extending shown on PR 107674 (though 
> it's a sign-extend there), that was introduced in gcc 11.
>
> Bootstrapped on aarch64-none-linux-gnu and regression tested on 
> arm-none-eabi and armeb-none-eabi for armv8.1-m.main+mve.fp.
>
> OK for trunk?
>
> gcc/ChangeLog:
>
> 	PR target/107674
>          * conig/arm/arm.cc (arm_hard_regno_mode_ok): Use new MACRO.
>          (arm_modes_tieable_p): Make MVE predicate modes tieable.
> 	* config/arm/arm.h (VALID_MVE_PRED_MODE):  New define.
> 	* simplify-rtx.cc (simplify_context::simplify_subreg): Teach
> 	simplify_subreg to simplify subregs where the outermode is not scalar.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/arm/mve/mve_vpt.c: Change to remove unecessary
> 	zero-extend.
>
> diff --git a/gcc/config/arm/arm.h b/gcc/config/arm/arm.h
> index 6f7ecf9128047647fc41677e634cd9612a13242b..4352c830cb6d2e632a225edea861b5ceb35dd035 100644
> --- a/gcc/config/arm/arm.h
> +++ b/gcc/config/arm/arm.h
> @@ -1091,6 +1091,10 @@ extern const int arm_arch_cde_coproc_bits[];
>     || (MODE) == V16QImode || (MODE) == V8HFmode || (MODE) == V4SFmode \
>     || (MODE) == V2DFmode)
>  
> +#define VALID_MVE_PRED_MODE(MODE) \
> +  ((MODE) == HImode							\
> +   || (MODE) == V16BImode || (MODE) == V8BImode || (MODE) == V4BImode)
> +
>  #define VALID_MVE_SI_MODE(MODE) \
>    ((MODE) == V2DImode ||(MODE) == V4SImode || (MODE) == V8HImode \
>     || (MODE) == V16QImode)
> diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
> index 3f171188de513e258369397e4726afe27bd9fdbf..18460ef5280be8c1df85eff424a1bf66d6019c0a 100644
> --- a/gcc/config/arm/arm.cc
> +++ b/gcc/config/arm/arm.cc
> @@ -25564,10 +25564,7 @@ arm_hard_regno_mode_ok (unsigned int regno, machine_mode mode)
>      return false;
>  
>    if (IS_VPR_REGNUM (regno))
> -    return mode == HImode
> -      || mode == V16BImode
> -      || mode == V8BImode
> -      || mode == V4BImode;
> +    return VALID_MVE_PRED_MODE (mode);
>  
>    if (TARGET_THUMB1)
>      /* For the Thumb we only allow values bigger than SImode in
> @@ -25646,6 +25643,10 @@ arm_modes_tieable_p (machine_mode mode1, machine_mode mode2)
>    if (GET_MODE_CLASS (mode1) == GET_MODE_CLASS (mode2))
>      return true;
>  
> +  if (TARGET_HAVE_MVE
> +      && (VALID_MVE_PRED_MODE (mode1) && VALID_MVE_PRED_MODE (mode2)))
> +    return true;
> +
>    /* We specifically want to allow elements of "structure" modes to
>       be tieable to the structure.  This more general condition allows
>       other rarer situations too.  */
> diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
> index 7fb1e97fbea4e7b8b091f5724ebe0cb61eee7ec3..a951272186585c0a5cc3e0155285e7a635865f42 100644
> --- a/gcc/simplify-rtx.cc
> +++ b/gcc/simplify-rtx.cc
> @@ -7652,6 +7652,22 @@ simplify_context::simplify_subreg (machine_mode outermode, rtx op,
>  	}
>      }
>  
> +  /* Try simplifying a SUBREG expression of a non-integer OUTERMODE by using a
> +     NEW_OUTERMODE of the same size instead, other simplifications rely on
> +     integer to integer subregs and we'd potentially miss out on optimizations
> +     otherwise.  */

How about:

  /* If the outer mode is not integral, try taking a subreg with the
     equivalent integer outer mode and then bitcasting the result.
     Other simplifications rely on integer to integer subregs and we'd
     potentially miss out on optimizations otherwise.  */

(Think it's easier to read as two sentences, and there's no
new_outermode in the code.)

> +  if (known_gt (GET_MODE_SIZE (innermode),
> +		GET_MODE_SIZE (outermode))
> +      && SCALAR_INT_MODE_P (innermode)
> +      && !SCALAR_INT_MODE_P (outermode)
> +      && int_mode_for_size (GET_MODE_BITSIZE (outermode),
> +			    0).exists (&int_outermode))
> +    {
> +      rtx tem = simplify_subreg (int_outermode, op, innermode, byte);
> +      if (tem)
> +	return simplify_gen_subreg (outermode, tem, GET_MODE (tem), byte);

Perhaps safer as s/GET_MODE (tem)/int_outermode/, in case TEM turns
out to be constant.

OK for the simplify-rtx.cc part with those changes.

Thanks,
Richard


> +    }
> +
>    /* If OP is a vector comparison and the subreg is not changing the
>       number of elements or the size of the elements, change the result
>       of the comparison to the new mode.  */
> diff --git a/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c b/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
> index 26a565b79dd1348e361b3aa23a1d6e6d13bffce8..8e562a9f065eff157f63ebd5acf9af0a2155b5c5 100644
> --- a/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
> +++ b/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
> @@ -16,9 +16,6 @@ void test0 (uint8_t *a, uint8_t *b, uint8_t *c)
>  **	vldrb.8	q2, \[r0\]
>  **	vldrb.8	q1, \[r1\]
>  **	vcmp.i8	eq, q2, q1
> -**	vmrs	r3, p0	@ movhi
> -**	uxth	r3, r3
> -**	vmsr	p0, r3	@ movhi
>  **	vpst
>  **	vaddt.i8	q3, q2, q1
>  **	vpst
Richard Sandiford Jan. 31, 2023, 6:15 a.m. UTC | #6
Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
>> Hi,
>>
>> This patch teaches GCC that zero-extending a MVE predicate from 16-bits 
>> to 32-bits and then only using 16-bits is a no-op.
>> It does so in two steps:
>> - it lets gcc know that it can access any MVE predicate mode using any 
>> other MVE predicate mode without needing to copy it, using the 
>> TARGET_MODES_TIEABLE_P hook,
>> - it teaches simplify_subreg to optimize a subreg with a vector 
>> outermode, by replacing this outermode with a same-sized integer mode 
>> and trying the avalailable optimizations, then if successful it 
>> surrounds the result with a subreg casting it back to the original 
>> vector outermode.
>>
>> This removes the unnecessary zero-extending shown on PR 107674 (though 
>> it's a sign-extend there), that was introduced in gcc 11.
>>
>> Bootstrapped on aarch64-none-linux-gnu and regression tested on 
>> arm-none-eabi and armeb-none-eabi for armv8.1-m.main+mve.fp.
>>
>> OK for trunk?
>>
>> gcc/ChangeLog:
>>
>> 	PR target/107674
>>          * conig/arm/arm.cc (arm_hard_regno_mode_ok): Use new MACRO.
>>          (arm_modes_tieable_p): Make MVE predicate modes tieable.
>> 	* config/arm/arm.h (VALID_MVE_PRED_MODE):  New define.
>> 	* simplify-rtx.cc (simplify_context::simplify_subreg): Teach
>> 	simplify_subreg to simplify subregs where the outermode is not scalar.
>>
>> gcc/testsuite/ChangeLog:
>>
>> 	* gcc.target/arm/mve/mve_vpt.c: Change to remove unecessary
>> 	zero-extend.
>>
>> diff --git a/gcc/config/arm/arm.h b/gcc/config/arm/arm.h
>> index 6f7ecf9128047647fc41677e634cd9612a13242b..4352c830cb6d2e632a225edea861b5ceb35dd035 100644
>> --- a/gcc/config/arm/arm.h
>> +++ b/gcc/config/arm/arm.h
>> @@ -1091,6 +1091,10 @@ extern const int arm_arch_cde_coproc_bits[];
>>     || (MODE) == V16QImode || (MODE) == V8HFmode || (MODE) == V4SFmode \
>>     || (MODE) == V2DFmode)
>>  
>> +#define VALID_MVE_PRED_MODE(MODE) \
>> +  ((MODE) == HImode							\
>> +   || (MODE) == V16BImode || (MODE) == V8BImode || (MODE) == V4BImode)
>> +
>>  #define VALID_MVE_SI_MODE(MODE) \
>>    ((MODE) == V2DImode ||(MODE) == V4SImode || (MODE) == V8HImode \
>>     || (MODE) == V16QImode)
>> diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
>> index 3f171188de513e258369397e4726afe27bd9fdbf..18460ef5280be8c1df85eff424a1bf66d6019c0a 100644
>> --- a/gcc/config/arm/arm.cc
>> +++ b/gcc/config/arm/arm.cc
>> @@ -25564,10 +25564,7 @@ arm_hard_regno_mode_ok (unsigned int regno, machine_mode mode)
>>      return false;
>>  
>>    if (IS_VPR_REGNUM (regno))
>> -    return mode == HImode
>> -      || mode == V16BImode
>> -      || mode == V8BImode
>> -      || mode == V4BImode;
>> +    return VALID_MVE_PRED_MODE (mode);
>>  
>>    if (TARGET_THUMB1)
>>      /* For the Thumb we only allow values bigger than SImode in
>> @@ -25646,6 +25643,10 @@ arm_modes_tieable_p (machine_mode mode1, machine_mode mode2)
>>    if (GET_MODE_CLASS (mode1) == GET_MODE_CLASS (mode2))
>>      return true;
>>  
>> +  if (TARGET_HAVE_MVE
>> +      && (VALID_MVE_PRED_MODE (mode1) && VALID_MVE_PRED_MODE (mode2)))
>> +    return true;
>> +
>>    /* We specifically want to allow elements of "structure" modes to
>>       be tieable to the structure.  This more general condition allows
>>       other rarer situations too.  */
>> diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
>> index 7fb1e97fbea4e7b8b091f5724ebe0cb61eee7ec3..a951272186585c0a5cc3e0155285e7a635865f42 100644
>> --- a/gcc/simplify-rtx.cc
>> +++ b/gcc/simplify-rtx.cc
>> @@ -7652,6 +7652,22 @@ simplify_context::simplify_subreg (machine_mode outermode, rtx op,
>>  	}
>>      }
>>  
>> +  /* Try simplifying a SUBREG expression of a non-integer OUTERMODE by using a
>> +     NEW_OUTERMODE of the same size instead, other simplifications rely on
>> +     integer to integer subregs and we'd potentially miss out on optimizations
>> +     otherwise.  */
>
> How about:
>
>   /* If the outer mode is not integral, try taking a subreg with the
>      equivalent integer outer mode and then bitcasting the result.
>      Other simplifications rely on integer to integer subregs and we'd
>      potentially miss out on optimizations otherwise.  */
>
> (Think it's easier to read as two sentences, and there's no
> new_outermode in the code.)
>
>> +  if (known_gt (GET_MODE_SIZE (innermode),
>> +		GET_MODE_SIZE (outermode))
>> +      && SCALAR_INT_MODE_P (innermode)
>> +      && !SCALAR_INT_MODE_P (outermode)
>> +      && int_mode_for_size (GET_MODE_BITSIZE (outermode),
>> +			    0).exists (&int_outermode))
>> +    {
>> +      rtx tem = simplify_subreg (int_outermode, op, innermode, byte);
>> +      if (tem)
>> +	return simplify_gen_subreg (outermode, tem, GET_MODE (tem), byte);
>
> Perhaps safer as s/GET_MODE (tem)/int_outermode/, in case TEM turns
> out to be constant.

Also, the final line should pass 0 rather than byte.  (The tem = line is OK.)

Thanks,
Richard

>
> OK for the simplify-rtx.cc part with those changes.
>
> Thanks,
> Richard
>
>
>> +    }
>> +
>>    /* If OP is a vector comparison and the subreg is not changing the
>>       number of elements or the size of the elements, change the result
>>       of the comparison to the new mode.  */
>> diff --git a/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c b/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
>> index 26a565b79dd1348e361b3aa23a1d6e6d13bffce8..8e562a9f065eff157f63ebd5acf9af0a2155b5c5 100644
>> --- a/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
>> +++ b/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
>> @@ -16,9 +16,6 @@ void test0 (uint8_t *a, uint8_t *b, uint8_t *c)
>>  **	vldrb.8	q2, \[r0\]
>>  **	vldrb.8	q1, \[r1\]
>>  **	vcmp.i8	eq, q2, q1
>> -**	vmrs	r3, p0	@ movhi
>> -**	uxth	r3, r3
>> -**	vmsr	p0, r3	@ movhi
>>  **	vpst
>>  **	vaddt.i8	q3, q2, q1
>>  **	vpst
diff mbox series

Patch

diff --git a/gcc/config/arm/arm.h b/gcc/config/arm/arm.h
index 6f7ecf9128047647fc41677e634cd9612a13242b..4352c830cb6d2e632a225edea861b5ceb35dd035 100644
--- a/gcc/config/arm/arm.h
+++ b/gcc/config/arm/arm.h
@@ -1091,6 +1091,10 @@  extern const int arm_arch_cde_coproc_bits[];
    || (MODE) == V16QImode || (MODE) == V8HFmode || (MODE) == V4SFmode \
    || (MODE) == V2DFmode)
 
+#define VALID_MVE_PRED_MODE(MODE) \
+  ((MODE) == HImode							\
+   || (MODE) == V16BImode || (MODE) == V8BImode || (MODE) == V4BImode)
+
 #define VALID_MVE_SI_MODE(MODE) \
   ((MODE) == V2DImode ||(MODE) == V4SImode || (MODE) == V8HImode \
    || (MODE) == V16QImode)
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 3f171188de513e258369397e4726afe27bd9fdbf..18460ef5280be8c1df85eff424a1bf66d6019c0a 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -25564,10 +25564,7 @@  arm_hard_regno_mode_ok (unsigned int regno, machine_mode mode)
     return false;
 
   if (IS_VPR_REGNUM (regno))
-    return mode == HImode
-      || mode == V16BImode
-      || mode == V8BImode
-      || mode == V4BImode;
+    return VALID_MVE_PRED_MODE (mode);
 
   if (TARGET_THUMB1)
     /* For the Thumb we only allow values bigger than SImode in
@@ -25646,6 +25643,10 @@  arm_modes_tieable_p (machine_mode mode1, machine_mode mode2)
   if (GET_MODE_CLASS (mode1) == GET_MODE_CLASS (mode2))
     return true;
 
+  if (TARGET_HAVE_MVE
+      && (VALID_MVE_PRED_MODE (mode1) && VALID_MVE_PRED_MODE (mode2)))
+    return true;
+
   /* We specifically want to allow elements of "structure" modes to
      be tieable to the structure.  This more general condition allows
      other rarer situations too.  */
diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index 7fb1e97fbea4e7b8b091f5724ebe0cb61eee7ec3..a951272186585c0a5cc3e0155285e7a635865f42 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -7652,6 +7652,22 @@  simplify_context::simplify_subreg (machine_mode outermode, rtx op,
 	}
     }
 
+  /* Try simplifying a SUBREG expression of a non-integer OUTERMODE by using a
+     NEW_OUTERMODE of the same size instead, other simplifications rely on
+     integer to integer subregs and we'd potentially miss out on optimizations
+     otherwise.  */
+  if (known_gt (GET_MODE_SIZE (innermode),
+		GET_MODE_SIZE (outermode))
+      && SCALAR_INT_MODE_P (innermode)
+      && !SCALAR_INT_MODE_P (outermode)
+      && int_mode_for_size (GET_MODE_BITSIZE (outermode),
+			    0).exists (&int_outermode))
+    {
+      rtx tem = simplify_subreg (int_outermode, op, innermode, byte);
+      if (tem)
+	return simplify_gen_subreg (outermode, tem, GET_MODE (tem), byte);
+    }
+
   /* If OP is a vector comparison and the subreg is not changing the
      number of elements or the size of the elements, change the result
      of the comparison to the new mode.  */
diff --git a/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c b/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
index 26a565b79dd1348e361b3aa23a1d6e6d13bffce8..8e562a9f065eff157f63ebd5acf9af0a2155b5c5 100644
--- a/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
+++ b/gcc/testsuite/gcc.target/arm/mve/mve_vpt.c
@@ -16,9 +16,6 @@  void test0 (uint8_t *a, uint8_t *b, uint8_t *c)
 **	vldrb.8	q2, \[r0\]
 **	vldrb.8	q1, \[r1\]
 **	vcmp.i8	eq, q2, q1
-**	vmrs	r3, p0	@ movhi
-**	uxth	r3, r3
-**	vmsr	p0, r3	@ movhi
 **	vpst
 **	vaddt.i8	q3, q2, q1
 **	vpst