diff mbox

vec_merge + vec_duplicate + vec_concat simplification

Message ID 5936693E.2050005@foss.arm.com
State New
Headers show

Commit Message

Kyrill Tkachov June 6, 2017, 8:35 a.m. UTC
Hi all,

Another vec_merge simplification that's missing is transforming:
(vec_merge (vec_duplicate x) (vec_concat (y) (z)) (const_int N))
into
(vec_concat x z) if N == 1 (0b01) or
(vec_concat y x) if N == 2 (0b10)

For the testcase in this patch on aarch64 this allows us to try matching during combine the pattern:
(set (reg:V2DI 78 [ x ])
     (vec_concat:V2DI
         (mem:DI (reg/v/f:DI 76 [ y ]) [1 *y_4(D)+0 S8 A64])
         (mem:DI (plus:DI (reg/v/f:DI 76 [ y ])
                 (const_int 8 [0x8])) [1 MEM[(long long int *)y_4(D) + 8B]+0 S8 A64])))

rather than the more complex:
(set (reg:V2DI 78 [ x ])
     (vec_merge:V2DI (vec_duplicate:V2DI (mem:DI (plus:DI (reg/v/f:DI 76 [ y ])
                     (const_int 8 [0x8])) [1 MEM[(long long int *)y_4(D) + 8B]+0 S8 A64]))
         (vec_duplicate:V2DI (mem:DI (reg/v/f:DI 76 [ y ]) [1 *y_4(D)+0 S8 A64]))
         (const_int 2 [0x2])))

We don't actually have an aarch64 pattern for the simplified version above, but it's a simple enough
form to add, so this patch adds such a pattern that performs a concatenated load of two 64-bit vectors
in adjacent memory locations as a single Q-register LDR. The new aarch64 pattern is needed to demonstrate
the effectiveness of the simplify-rtx change, so I've kept them together as one patch.

Now for the testcase in the patch we can generate:
construct_lanedi:
         ldr     q0, [x0]
         ret

construct_lanedf:
         ldr     q0, [x0]
         ret

instead of:
construct_lanedi:
         ld1r    {v0.2d}, [x0]
         ldr     x0, [x0, 8]
         ins     v0.d[1], x0
         ret

construct_lanedf:
         ld1r    {v0.2d}, [x0]
         ldr     d1, [x0, 8]
         ins     v0.d[1], v1.d[0]
         ret

The new memory constraint Utq is needed because we need to allow only the Q-register addressing modes but
the MEM expressions in the RTL pattern have 64-bit vector modes, and if we don't constrain them they will
allow the D-register addressing modes during register allocation/address mode selection, which will produce
invalid assembly.

Bootstrapped and tested on aarch64-none-linux-gnu.
Ok for trunk?

Thanks,
Kyrill

2017-06-06  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * simplify-rtx.c (simplify_ternary_operation, VEC_MERGE):
     Simplify vec_merge of vec_duplicate and vec_concat.
     * config/aarch64/constraints.md (Utq): New constraint.
     * config/aarch64/aarch64-simd.md (load_pair_lanes<mode>): New
     define_insn.

2017-06-06  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * gcc.target/aarch64/load_v2vec_lanes_1.c: New test.

Comments

Marc Glisse June 6, 2017, 9:40 p.m. UTC | #1
On Tue, 6 Jun 2017, Kyrill Tkachov wrote:

> Another vec_merge simplification that's missing is transforming:
> (vec_merge (vec_duplicate x) (vec_concat (y) (z)) (const_int N))
> into
> (vec_concat x z) if N == 1 (0b01) or
> (vec_concat y x) if N == 2 (0b10)

Do we have a canonicalization somewhere that guarantees we cannot get
(vec_merge (vec_concat (y) (z)) (vec_duplicate x) (const_int N))
?

I was wondering if it would be possible to merge the transformations for 
concat and duplicate into a single one, but maybe it becomes too 
unreadable.
Jeff Law June 27, 2017, 10:28 p.m. UTC | #2
On 06/06/2017 02:35 AM, Kyrill Tkachov wrote:
> Hi all,
> 
> Another vec_merge simplification that's missing is transforming:
> (vec_merge (vec_duplicate x) (vec_concat (y) (z)) (const_int N))
> into
> (vec_concat x z) if N == 1 (0b01) or
> (vec_concat y x) if N == 2 (0b10)
> 
> For the testcase in this patch on aarch64 this allows us to try matching
> during combine the pattern:
> (set (reg:V2DI 78 [ x ])
>     (vec_concat:V2DI
>         (mem:DI (reg/v/f:DI 76 [ y ]) [1 *y_4(D)+0 S8 A64])
>         (mem:DI (plus:DI (reg/v/f:DI 76 [ y ])
>                 (const_int 8 [0x8])) [1 MEM[(long long int *)y_4(D) +
> 8B]+0 S8 A64])))
> 
> rather than the more complex:
> (set (reg:V2DI 78 [ x ])
>     (vec_merge:V2DI (vec_duplicate:V2DI (mem:DI (plus:DI (reg/v/f:DI 76
> [ y ])
>                     (const_int 8 [0x8])) [1 MEM[(long long int *)y_4(D)
> + 8B]+0 S8 A64]))
>         (vec_duplicate:V2DI (mem:DI (reg/v/f:DI 76 [ y ]) [1 *y_4(D)+0
> S8 A64]))
>         (const_int 2 [0x2])))
> 
> We don't actually have an aarch64 pattern for the simplified version
> above, but it's a simple enough
> form to add, so this patch adds such a pattern that performs a
> concatenated load of two 64-bit vectors
> in adjacent memory locations as a single Q-register LDR. The new aarch64
> pattern is needed to demonstrate
> the effectiveness of the simplify-rtx change, so I've kept them together
> as one patch.
> 
> Now for the testcase in the patch we can generate:
> construct_lanedi:
>         ldr     q0, [x0]
>         ret
> 
> construct_lanedf:
>         ldr     q0, [x0]
>         ret
> 
> instead of:
> construct_lanedi:
>         ld1r    {v0.2d}, [x0]
>         ldr     x0, [x0, 8]
>         ins     v0.d[1], x0
>         ret
> 
> construct_lanedf:
>         ld1r    {v0.2d}, [x0]
>         ldr     d1, [x0, 8]
>         ins     v0.d[1], v1.d[0]
>         ret
> 
> The new memory constraint Utq is needed because we need to allow only
> the Q-register addressing modes but
> the MEM expressions in the RTL pattern have 64-bit vector modes, and if
> we don't constrain them they will
> allow the D-register addressing modes during register allocation/address
> mode selection, which will produce
> invalid assembly.
> 
> Bootstrapped and tested on aarch64-none-linux-gnu.
> Ok for trunk?
> 
> Thanks,
> Kyrill
> 
> 2017-06-06  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
> 
>     * simplify-rtx.c (simplify_ternary_operation, VEC_MERGE):
>     Simplify vec_merge of vec_duplicate and vec_concat.
>     * config/aarch64/constraints.md (Utq): New constraint.
>     * config/aarch64/aarch64-simd.md (load_pair_lanes<mode>): New
>     define_insn.
> 
> 2017-06-06  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
> 
>     * gcc.target/aarch64/load_v2vec_lanes_1.c: New test.
OK for the simplify-rtx bits.

jeff
Kyrill Tkachov July 5, 2017, 3:14 p.m. UTC | #3
On 27/06/17 23:28, Jeff Law wrote:
> On 06/06/2017 02:35 AM, Kyrill Tkachov wrote:
>> Hi all,
>>
>> Another vec_merge simplification that's missing is transforming:
>> (vec_merge (vec_duplicate x) (vec_concat (y) (z)) (const_int N))
>> into
>> (vec_concat x z) if N == 1 (0b01) or
>> (vec_concat y x) if N == 2 (0b10)
>>
>> For the testcase in this patch on aarch64 this allows us to try matching
>> during combine the pattern:
>> (set (reg:V2DI 78 [ x ])
>>      (vec_concat:V2DI
>>          (mem:DI (reg/v/f:DI 76 [ y ]) [1 *y_4(D)+0 S8 A64])
>>          (mem:DI (plus:DI (reg/v/f:DI 76 [ y ])
>>                  (const_int 8 [0x8])) [1 MEM[(long long int *)y_4(D) +
>> 8B]+0 S8 A64])))
>>
>> rather than the more complex:
>> (set (reg:V2DI 78 [ x ])
>>      (vec_merge:V2DI (vec_duplicate:V2DI (mem:DI (plus:DI (reg/v/f:DI 76
>> [ y ])
>>                      (const_int 8 [0x8])) [1 MEM[(long long int *)y_4(D)
>> + 8B]+0 S8 A64]))
>>          (vec_duplicate:V2DI (mem:DI (reg/v/f:DI 76 [ y ]) [1 *y_4(D)+0
>> S8 A64]))
>>          (const_int 2 [0x2])))
>>
>> We don't actually have an aarch64 pattern for the simplified version
>> above, but it's a simple enough
>> form to add, so this patch adds such a pattern that performs a
>> concatenated load of two 64-bit vectors
>> in adjacent memory locations as a single Q-register LDR. The new aarch64
>> pattern is needed to demonstrate
>> the effectiveness of the simplify-rtx change, so I've kept them together
>> as one patch.
>>
>> Now for the testcase in the patch we can generate:
>> construct_lanedi:
>>          ldr     q0, [x0]
>>          ret
>>
>> construct_lanedf:
>>          ldr     q0, [x0]
>>          ret
>>
>> instead of:
>> construct_lanedi:
>>          ld1r    {v0.2d}, [x0]
>>          ldr     x0, [x0, 8]
>>          ins     v0.d[1], x0
>>          ret
>>
>> construct_lanedf:
>>          ld1r    {v0.2d}, [x0]
>>          ldr     d1, [x0, 8]
>>          ins     v0.d[1], v1.d[0]
>>          ret
>>
>> The new memory constraint Utq is needed because we need to allow only
>> the Q-register addressing modes but
>> the MEM expressions in the RTL pattern have 64-bit vector modes, and if
>> we don't constrain them they will
>> allow the D-register addressing modes during register allocation/address
>> mode selection, which will produce
>> invalid assembly.
>>
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>> Ok for trunk?
>>
>> Thanks,
>> Kyrill
>>
>> 2017-06-06  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>
>>      * simplify-rtx.c (simplify_ternary_operation, VEC_MERGE):
>>      Simplify vec_merge of vec_duplicate and vec_concat.
>>      * config/aarch64/constraints.md (Utq): New constraint.
>>      * config/aarch64/aarch64-simd.md (load_pair_lanes<mode>): New
>>      define_insn.
>>
>> 2017-06-06  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>
>>      * gcc.target/aarch64/load_v2vec_lanes_1.c: New test.
> OK for the simplify-rtx bits.

Thanks Jeff.
I'd like to ping the aarch64 bits:
https://gcc.gnu.org/ml/gcc-patches/2017-06/msg00273.html

I've re-bootstrapped and re-tested these patches on aarch64 with today's trunk.

Kyrill

> jeff
>
Kyrill Tkachov July 18, 2017, 8:54 a.m. UTC | #4
On 05/07/17 16:14, Kyrill Tkachov wrote:
>
> On 27/06/17 23:28, Jeff Law wrote:
>> On 06/06/2017 02:35 AM, Kyrill Tkachov wrote:
>>> Hi all,
>>>
>>> Another vec_merge simplification that's missing is transforming:
>>> (vec_merge (vec_duplicate x) (vec_concat (y) (z)) (const_int N))
>>> into
>>> (vec_concat x z) if N == 1 (0b01) or
>>> (vec_concat y x) if N == 2 (0b10)
>>>
>>> For the testcase in this patch on aarch64 this allows us to try matching
>>> during combine the pattern:
>>> (set (reg:V2DI 78 [ x ])
>>>      (vec_concat:V2DI
>>>          (mem:DI (reg/v/f:DI 76 [ y ]) [1 *y_4(D)+0 S8 A64])
>>>          (mem:DI (plus:DI (reg/v/f:DI 76 [ y ])
>>>                  (const_int 8 [0x8])) [1 MEM[(long long int *)y_4(D) +
>>> 8B]+0 S8 A64])))
>>>
>>> rather than the more complex:
>>> (set (reg:V2DI 78 [ x ])
>>>      (vec_merge:V2DI (vec_duplicate:V2DI (mem:DI (plus:DI (reg/v/f:DI 76
>>> [ y ])
>>>                      (const_int 8 [0x8])) [1 MEM[(long long int *)y_4(D)
>>> + 8B]+0 S8 A64]))
>>>          (vec_duplicate:V2DI (mem:DI (reg/v/f:DI 76 [ y ]) [1 *y_4(D)+0
>>> S8 A64]))
>>>          (const_int 2 [0x2])))
>>>
>>> We don't actually have an aarch64 pattern for the simplified version
>>> above, but it's a simple enough
>>> form to add, so this patch adds such a pattern that performs a
>>> concatenated load of two 64-bit vectors
>>> in adjacent memory locations as a single Q-register LDR. The new aarch64
>>> pattern is needed to demonstrate
>>> the effectiveness of the simplify-rtx change, so I've kept them together
>>> as one patch.
>>>
>>> Now for the testcase in the patch we can generate:
>>> construct_lanedi:
>>>          ldr     q0, [x0]
>>>          ret
>>>
>>> construct_lanedf:
>>>          ldr     q0, [x0]
>>>          ret
>>>
>>> instead of:
>>> construct_lanedi:
>>>          ld1r    {v0.2d}, [x0]
>>>          ldr     x0, [x0, 8]
>>>          ins     v0.d[1], x0
>>>          ret
>>>
>>> construct_lanedf:
>>>          ld1r    {v0.2d}, [x0]
>>>          ldr     d1, [x0, 8]
>>>          ins     v0.d[1], v1.d[0]
>>>          ret
>>>
>>> The new memory constraint Utq is needed because we need to allow only
>>> the Q-register addressing modes but
>>> the MEM expressions in the RTL pattern have 64-bit vector modes, and if
>>> we don't constrain them they will
>>> allow the D-register addressing modes during register allocation/address
>>> mode selection, which will produce
>>> invalid assembly.
>>>
>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>> Ok for trunk?
>>>
>>> Thanks,
>>> Kyrill
>>>
>>> 2017-06-06  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>
>>>      * simplify-rtx.c (simplify_ternary_operation, VEC_MERGE):
>>>      Simplify vec_merge of vec_duplicate and vec_concat.
>>>      * config/aarch64/constraints.md (Utq): New constraint.
>>>      * config/aarch64/aarch64-simd.md (load_pair_lanes<mode>): New
>>>      define_insn.
>>>
>>> 2017-06-06  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>
>>>      * gcc.target/aarch64/load_v2vec_lanes_1.c: New test.
>> OK for the simplify-rtx bits.
>
> Thanks Jeff.
> I'd like to ping the aarch64 bits:
> https://gcc.gnu.org/ml/gcc-patches/2017-06/msg00273.html
>

Ping.

Thanks,
Kyrill

> I've re-bootstrapped and re-tested these patches on aarch64 with today's trunk.
>
> Kyrill
>
>> jeff
>>
>
James Greenhalgh July 24, 2017, 10:45 a.m. UTC | #5
On Tue, Jun 06, 2017 at 09:35:10AM +0100, Kyrill Tkachov wrote:
> Hi all,
> 
> Another vec_merge simplification that's missing is transforming:
> (vec_merge (vec_duplicate x) (vec_concat (y) (z)) (const_int N))
> into
> (vec_concat x z) if N == 1 (0b01) or
> (vec_concat y x) if N == 2 (0b10)
> 
> For the testcase in this patch on aarch64 this allows us to try matching during combine the pattern:
> (set (reg:V2DI 78 [ x ])
>     (vec_concat:V2DI
>         (mem:DI (reg/v/f:DI 76 [ y ]) [1 *y_4(D)+0 S8 A64])
>         (mem:DI (plus:DI (reg/v/f:DI 76 [ y ])
>                 (const_int 8 [0x8])) [1 MEM[(long long int *)y_4(D) + 8B]+0 S8 A64])))
> 
> rather than the more complex:
> (set (reg:V2DI 78 [ x ])
>     (vec_merge:V2DI (vec_duplicate:V2DI (mem:DI (plus:DI (reg/v/f:DI 76 [ y ])
>                     (const_int 8 [0x8])) [1 MEM[(long long int *)y_4(D) + 8B]+0 S8 A64]))
>         (vec_duplicate:V2DI (mem:DI (reg/v/f:DI 76 [ y ]) [1 *y_4(D)+0 S8 A64]))
>         (const_int 2 [0x2])))
> 
> We don't actually have an aarch64 pattern for the simplified version above, but it's a simple enough
> form to add, so this patch adds such a pattern that performs a concatenated load of two 64-bit vectors
> in adjacent memory locations as a single Q-register LDR. The new aarch64 pattern is needed to demonstrate
> the effectiveness of the simplify-rtx change, so I've kept them together as one patch.
> 
> Now for the testcase in the patch we can generate:
> construct_lanedi:
>         ldr     q0, [x0]
>         ret
> 
> construct_lanedf:
>         ldr     q0, [x0]
>         ret
> 
> instead of:
> construct_lanedi:
>         ld1r    {v0.2d}, [x0]
>         ldr     x0, [x0, 8]
>         ins     v0.d[1], x0
>         ret
> 
> construct_lanedf:
>         ld1r    {v0.2d}, [x0]
>         ldr     d1, [x0, 8]
>         ins     v0.d[1], v1.d[0]
>         ret
> 
> The new memory constraint Utq is needed because we need to allow only the Q-register addressing modes but
> the MEM expressions in the RTL pattern have 64-bit vector modes, and if we don't constrain them they will
> allow the D-register addressing modes during register allocation/address mode selection, which will produce
> invalid assembly.
> 
> Bootstrapped and tested on aarch64-none-linux-gnu.
> Ok for trunk?

I'd appreciate a more thorough set of tests, looking at some of the vector
mode combines that you now permit. I'm (only a little) nervous that you only
test here for the DI/DFmode cases and that there is no testing for
V2SI etc., nor tests for strict align, nor tests for when the addesses
must block the transformation.

The patch itself looks OK, but it could do with better tests.

Thanks,
James
diff mbox

Patch

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 77a3a7d6534e5fd3575e33d5a7c607713abd614b..b78affe9b06ffc973888822a4fcf1ec8e80ecdf6 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -2803,6 +2803,20 @@  (define_insn "aarch64_get_lane<mode>"
   [(set_attr "type" "neon_to_gp<q>, neon_dup<q>, neon_store1_one_lane<q>")]
 )
 
+(define_insn "load_pair_lanes<mode>"
+  [(set (match_operand:<VDBL> 0 "register_operand" "=w")
+	(vec_concat:<VDBL>
+	   (match_operand:VDC 1 "memory_operand" "Utq")
+	   (match_operand:VDC 2 "memory_operand" "m")))]
+  "TARGET_SIMD && !STRICT_ALIGNMENT
+   && rtx_equal_p (XEXP (operands[2], 0),
+		   plus_constant (Pmode,
+				  XEXP (operands[1], 0),
+				  GET_MODE_SIZE (<MODE>mode)))"
+  "ldr\\t%q0, %1"
+  [(set_attr "type" "neon_load1_1reg_q")]
+)
+
 ;; In this insn, operand 1 should be low, and operand 2 the high part of the
 ;; dest vector.
 
diff --git a/gcc/config/aarch64/constraints.md b/gcc/config/aarch64/constraints.md
index b8293376fde7e03c4cfc2a6ad6268201f487eb92..ab607b9f7488e903a14fe93e88d4c4e1fad762b3 100644
--- a/gcc/config/aarch64/constraints.md
+++ b/gcc/config/aarch64/constraints.md
@@ -161,6 +161,13 @@  (define_memory_constraint "Utv"
   (and (match_code "mem")
        (match_test "aarch64_simd_mem_operand_p (op)")))
 
+(define_memory_constraint "Utq"
+  "@internal
+   An address valid for loading or storing a 128-bit AdvSIMD register"
+  (and (match_code "mem")
+       (match_test "aarch64_legitimate_address_p (V2DImode, XEXP (op, 0),
+						  MEM, 1)")))
+
 (define_constraint "Ufc"
   "A floating point constant which can be used with an\
    FMOV immediate operation."
diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index 42824b6c61af37f6b005de75bd1e5ebe7522bdba..a4aebae68afc14a69870e1fd280d28251aa5f398 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -5701,6 +5701,25 @@  simplify_ternary_operation (enum rtx_code code, machine_mode mode,
 		std::swap (newop0, newop1);
 	      return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
 	    }
+	  /* Replace (vec_merge (vec_duplicate x) (vec_concat (y) (z)) (const_int N))
+	     with (vec_concat x z) if N == 1, or (vec_concat y x) if N == 2.
+	     Only applies for vectors of two elements.  */
+	  if (GET_CODE (op0) == VEC_DUPLICATE
+	      && GET_CODE (op1) == VEC_CONCAT
+	      && GET_MODE_NUNITS (GET_MODE (op0)) == 2
+	      && GET_MODE_NUNITS (GET_MODE (op1)) == 2
+	      && IN_RANGE (sel, 1, 2))
+	    {
+	      rtx newop0 = XEXP (op0, 0);
+	      rtx newop1 = XEXP (op1, 2 - sel);
+	      rtx otherop = XEXP (op1, sel - 1);
+	      if (sel == 2)
+		std::swap (newop0, newop1);
+	      /* Don't want to throw away the other part of the vec_concat if
+		 it has side-effects.  */
+	      if (!side_effects_p (otherop))
+		return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
+	    }
 	}
 
       if (rtx_equal_p (op0, op1)
diff --git a/gcc/testsuite/gcc.target/aarch64/load_v2vec_lanes_1.c b/gcc/testsuite/gcc.target/aarch64/load_v2vec_lanes_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..3c31b340154b5469fca858a579e9a6ab90ee0d22
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/load_v2vec_lanes_1.c
@@ -0,0 +1,26 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+typedef long long v2di __attribute__ ((vector_size (16)));
+typedef double v2df __attribute__ ((vector_size (16)));
+
+v2di
+construct_lanedi (long long *y)
+{
+  v2di x = { y[0], y[1] };
+  return x;
+}
+
+v2df
+construct_lanedf (double *y)
+{
+  v2df x = { y[0], y[1] };
+  return x;
+}
+
+/* We can use the load_pair_lanes<mode> pattern to vec_concat two DI/DF
+   values from consecutive memory into a 2-element vector by using
+   a Q-reg LDR.  */
+
+/* { dg-final { scan-assembler-times "ldr\tq\[0-9\]+" 2 } } */
+/* { dg-final { scan-assembler-not "ins\t" } } */