Message ID | 5936693E.2050005@foss.arm.com |
---|---|
State | New |
Headers | show |
On Tue, 6 Jun 2017, Kyrill Tkachov wrote: > Another vec_merge simplification that's missing is transforming: > (vec_merge (vec_duplicate x) (vec_concat (y) (z)) (const_int N)) > into > (vec_concat x z) if N == 1 (0b01) or > (vec_concat y x) if N == 2 (0b10) Do we have a canonicalization somewhere that guarantees we cannot get (vec_merge (vec_concat (y) (z)) (vec_duplicate x) (const_int N)) ? I was wondering if it would be possible to merge the transformations for concat and duplicate into a single one, but maybe it becomes too unreadable.
On 06/06/2017 02:35 AM, Kyrill Tkachov wrote: > Hi all, > > Another vec_merge simplification that's missing is transforming: > (vec_merge (vec_duplicate x) (vec_concat (y) (z)) (const_int N)) > into > (vec_concat x z) if N == 1 (0b01) or > (vec_concat y x) if N == 2 (0b10) > > For the testcase in this patch on aarch64 this allows us to try matching > during combine the pattern: > (set (reg:V2DI 78 [ x ]) > (vec_concat:V2DI > (mem:DI (reg/v/f:DI 76 [ y ]) [1 *y_4(D)+0 S8 A64]) > (mem:DI (plus:DI (reg/v/f:DI 76 [ y ]) > (const_int 8 [0x8])) [1 MEM[(long long int *)y_4(D) + > 8B]+0 S8 A64]))) > > rather than the more complex: > (set (reg:V2DI 78 [ x ]) > (vec_merge:V2DI (vec_duplicate:V2DI (mem:DI (plus:DI (reg/v/f:DI 76 > [ y ]) > (const_int 8 [0x8])) [1 MEM[(long long int *)y_4(D) > + 8B]+0 S8 A64])) > (vec_duplicate:V2DI (mem:DI (reg/v/f:DI 76 [ y ]) [1 *y_4(D)+0 > S8 A64])) > (const_int 2 [0x2]))) > > We don't actually have an aarch64 pattern for the simplified version > above, but it's a simple enough > form to add, so this patch adds such a pattern that performs a > concatenated load of two 64-bit vectors > in adjacent memory locations as a single Q-register LDR. The new aarch64 > pattern is needed to demonstrate > the effectiveness of the simplify-rtx change, so I've kept them together > as one patch. > > Now for the testcase in the patch we can generate: > construct_lanedi: > ldr q0, [x0] > ret > > construct_lanedf: > ldr q0, [x0] > ret > > instead of: > construct_lanedi: > ld1r {v0.2d}, [x0] > ldr x0, [x0, 8] > ins v0.d[1], x0 > ret > > construct_lanedf: > ld1r {v0.2d}, [x0] > ldr d1, [x0, 8] > ins v0.d[1], v1.d[0] > ret > > The new memory constraint Utq is needed because we need to allow only > the Q-register addressing modes but > the MEM expressions in the RTL pattern have 64-bit vector modes, and if > we don't constrain them they will > allow the D-register addressing modes during register allocation/address > mode selection, which will produce > invalid assembly. > > Bootstrapped and tested on aarch64-none-linux-gnu. > Ok for trunk? > > Thanks, > Kyrill > > 2017-06-06 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > > * simplify-rtx.c (simplify_ternary_operation, VEC_MERGE): > Simplify vec_merge of vec_duplicate and vec_concat. > * config/aarch64/constraints.md (Utq): New constraint. > * config/aarch64/aarch64-simd.md (load_pair_lanes<mode>): New > define_insn. > > 2017-06-06 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > > * gcc.target/aarch64/load_v2vec_lanes_1.c: New test. OK for the simplify-rtx bits. jeff
On 27/06/17 23:28, Jeff Law wrote: > On 06/06/2017 02:35 AM, Kyrill Tkachov wrote: >> Hi all, >> >> Another vec_merge simplification that's missing is transforming: >> (vec_merge (vec_duplicate x) (vec_concat (y) (z)) (const_int N)) >> into >> (vec_concat x z) if N == 1 (0b01) or >> (vec_concat y x) if N == 2 (0b10) >> >> For the testcase in this patch on aarch64 this allows us to try matching >> during combine the pattern: >> (set (reg:V2DI 78 [ x ]) >> (vec_concat:V2DI >> (mem:DI (reg/v/f:DI 76 [ y ]) [1 *y_4(D)+0 S8 A64]) >> (mem:DI (plus:DI (reg/v/f:DI 76 [ y ]) >> (const_int 8 [0x8])) [1 MEM[(long long int *)y_4(D) + >> 8B]+0 S8 A64]))) >> >> rather than the more complex: >> (set (reg:V2DI 78 [ x ]) >> (vec_merge:V2DI (vec_duplicate:V2DI (mem:DI (plus:DI (reg/v/f:DI 76 >> [ y ]) >> (const_int 8 [0x8])) [1 MEM[(long long int *)y_4(D) >> + 8B]+0 S8 A64])) >> (vec_duplicate:V2DI (mem:DI (reg/v/f:DI 76 [ y ]) [1 *y_4(D)+0 >> S8 A64])) >> (const_int 2 [0x2]))) >> >> We don't actually have an aarch64 pattern for the simplified version >> above, but it's a simple enough >> form to add, so this patch adds such a pattern that performs a >> concatenated load of two 64-bit vectors >> in adjacent memory locations as a single Q-register LDR. The new aarch64 >> pattern is needed to demonstrate >> the effectiveness of the simplify-rtx change, so I've kept them together >> as one patch. >> >> Now for the testcase in the patch we can generate: >> construct_lanedi: >> ldr q0, [x0] >> ret >> >> construct_lanedf: >> ldr q0, [x0] >> ret >> >> instead of: >> construct_lanedi: >> ld1r {v0.2d}, [x0] >> ldr x0, [x0, 8] >> ins v0.d[1], x0 >> ret >> >> construct_lanedf: >> ld1r {v0.2d}, [x0] >> ldr d1, [x0, 8] >> ins v0.d[1], v1.d[0] >> ret >> >> The new memory constraint Utq is needed because we need to allow only >> the Q-register addressing modes but >> the MEM expressions in the RTL pattern have 64-bit vector modes, and if >> we don't constrain them they will >> allow the D-register addressing modes during register allocation/address >> mode selection, which will produce >> invalid assembly. >> >> Bootstrapped and tested on aarch64-none-linux-gnu. >> Ok for trunk? >> >> Thanks, >> Kyrill >> >> 2017-06-06 Kyrylo Tkachov <kyrylo.tkachov@arm.com> >> >> * simplify-rtx.c (simplify_ternary_operation, VEC_MERGE): >> Simplify vec_merge of vec_duplicate and vec_concat. >> * config/aarch64/constraints.md (Utq): New constraint. >> * config/aarch64/aarch64-simd.md (load_pair_lanes<mode>): New >> define_insn. >> >> 2017-06-06 Kyrylo Tkachov <kyrylo.tkachov@arm.com> >> >> * gcc.target/aarch64/load_v2vec_lanes_1.c: New test. > OK for the simplify-rtx bits. Thanks Jeff. I'd like to ping the aarch64 bits: https://gcc.gnu.org/ml/gcc-patches/2017-06/msg00273.html I've re-bootstrapped and re-tested these patches on aarch64 with today's trunk. Kyrill > jeff >
On 05/07/17 16:14, Kyrill Tkachov wrote: > > On 27/06/17 23:28, Jeff Law wrote: >> On 06/06/2017 02:35 AM, Kyrill Tkachov wrote: >>> Hi all, >>> >>> Another vec_merge simplification that's missing is transforming: >>> (vec_merge (vec_duplicate x) (vec_concat (y) (z)) (const_int N)) >>> into >>> (vec_concat x z) if N == 1 (0b01) or >>> (vec_concat y x) if N == 2 (0b10) >>> >>> For the testcase in this patch on aarch64 this allows us to try matching >>> during combine the pattern: >>> (set (reg:V2DI 78 [ x ]) >>> (vec_concat:V2DI >>> (mem:DI (reg/v/f:DI 76 [ y ]) [1 *y_4(D)+0 S8 A64]) >>> (mem:DI (plus:DI (reg/v/f:DI 76 [ y ]) >>> (const_int 8 [0x8])) [1 MEM[(long long int *)y_4(D) + >>> 8B]+0 S8 A64]))) >>> >>> rather than the more complex: >>> (set (reg:V2DI 78 [ x ]) >>> (vec_merge:V2DI (vec_duplicate:V2DI (mem:DI (plus:DI (reg/v/f:DI 76 >>> [ y ]) >>> (const_int 8 [0x8])) [1 MEM[(long long int *)y_4(D) >>> + 8B]+0 S8 A64])) >>> (vec_duplicate:V2DI (mem:DI (reg/v/f:DI 76 [ y ]) [1 *y_4(D)+0 >>> S8 A64])) >>> (const_int 2 [0x2]))) >>> >>> We don't actually have an aarch64 pattern for the simplified version >>> above, but it's a simple enough >>> form to add, so this patch adds such a pattern that performs a >>> concatenated load of two 64-bit vectors >>> in adjacent memory locations as a single Q-register LDR. The new aarch64 >>> pattern is needed to demonstrate >>> the effectiveness of the simplify-rtx change, so I've kept them together >>> as one patch. >>> >>> Now for the testcase in the patch we can generate: >>> construct_lanedi: >>> ldr q0, [x0] >>> ret >>> >>> construct_lanedf: >>> ldr q0, [x0] >>> ret >>> >>> instead of: >>> construct_lanedi: >>> ld1r {v0.2d}, [x0] >>> ldr x0, [x0, 8] >>> ins v0.d[1], x0 >>> ret >>> >>> construct_lanedf: >>> ld1r {v0.2d}, [x0] >>> ldr d1, [x0, 8] >>> ins v0.d[1], v1.d[0] >>> ret >>> >>> The new memory constraint Utq is needed because we need to allow only >>> the Q-register addressing modes but >>> the MEM expressions in the RTL pattern have 64-bit vector modes, and if >>> we don't constrain them they will >>> allow the D-register addressing modes during register allocation/address >>> mode selection, which will produce >>> invalid assembly. >>> >>> Bootstrapped and tested on aarch64-none-linux-gnu. >>> Ok for trunk? >>> >>> Thanks, >>> Kyrill >>> >>> 2017-06-06 Kyrylo Tkachov <kyrylo.tkachov@arm.com> >>> >>> * simplify-rtx.c (simplify_ternary_operation, VEC_MERGE): >>> Simplify vec_merge of vec_duplicate and vec_concat. >>> * config/aarch64/constraints.md (Utq): New constraint. >>> * config/aarch64/aarch64-simd.md (load_pair_lanes<mode>): New >>> define_insn. >>> >>> 2017-06-06 Kyrylo Tkachov <kyrylo.tkachov@arm.com> >>> >>> * gcc.target/aarch64/load_v2vec_lanes_1.c: New test. >> OK for the simplify-rtx bits. > > Thanks Jeff. > I'd like to ping the aarch64 bits: > https://gcc.gnu.org/ml/gcc-patches/2017-06/msg00273.html > Ping. Thanks, Kyrill > I've re-bootstrapped and re-tested these patches on aarch64 with today's trunk. > > Kyrill > >> jeff >> >
On Tue, Jun 06, 2017 at 09:35:10AM +0100, Kyrill Tkachov wrote: > Hi all, > > Another vec_merge simplification that's missing is transforming: > (vec_merge (vec_duplicate x) (vec_concat (y) (z)) (const_int N)) > into > (vec_concat x z) if N == 1 (0b01) or > (vec_concat y x) if N == 2 (0b10) > > For the testcase in this patch on aarch64 this allows us to try matching during combine the pattern: > (set (reg:V2DI 78 [ x ]) > (vec_concat:V2DI > (mem:DI (reg/v/f:DI 76 [ y ]) [1 *y_4(D)+0 S8 A64]) > (mem:DI (plus:DI (reg/v/f:DI 76 [ y ]) > (const_int 8 [0x8])) [1 MEM[(long long int *)y_4(D) + 8B]+0 S8 A64]))) > > rather than the more complex: > (set (reg:V2DI 78 [ x ]) > (vec_merge:V2DI (vec_duplicate:V2DI (mem:DI (plus:DI (reg/v/f:DI 76 [ y ]) > (const_int 8 [0x8])) [1 MEM[(long long int *)y_4(D) + 8B]+0 S8 A64])) > (vec_duplicate:V2DI (mem:DI (reg/v/f:DI 76 [ y ]) [1 *y_4(D)+0 S8 A64])) > (const_int 2 [0x2]))) > > We don't actually have an aarch64 pattern for the simplified version above, but it's a simple enough > form to add, so this patch adds such a pattern that performs a concatenated load of two 64-bit vectors > in adjacent memory locations as a single Q-register LDR. The new aarch64 pattern is needed to demonstrate > the effectiveness of the simplify-rtx change, so I've kept them together as one patch. > > Now for the testcase in the patch we can generate: > construct_lanedi: > ldr q0, [x0] > ret > > construct_lanedf: > ldr q0, [x0] > ret > > instead of: > construct_lanedi: > ld1r {v0.2d}, [x0] > ldr x0, [x0, 8] > ins v0.d[1], x0 > ret > > construct_lanedf: > ld1r {v0.2d}, [x0] > ldr d1, [x0, 8] > ins v0.d[1], v1.d[0] > ret > > The new memory constraint Utq is needed because we need to allow only the Q-register addressing modes but > the MEM expressions in the RTL pattern have 64-bit vector modes, and if we don't constrain them they will > allow the D-register addressing modes during register allocation/address mode selection, which will produce > invalid assembly. > > Bootstrapped and tested on aarch64-none-linux-gnu. > Ok for trunk? I'd appreciate a more thorough set of tests, looking at some of the vector mode combines that you now permit. I'm (only a little) nervous that you only test here for the DI/DFmode cases and that there is no testing for V2SI etc., nor tests for strict align, nor tests for when the addesses must block the transformation. The patch itself looks OK, but it could do with better tests. Thanks, James
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 77a3a7d6534e5fd3575e33d5a7c607713abd614b..b78affe9b06ffc973888822a4fcf1ec8e80ecdf6 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -2803,6 +2803,20 @@ (define_insn "aarch64_get_lane<mode>" [(set_attr "type" "neon_to_gp<q>, neon_dup<q>, neon_store1_one_lane<q>")] ) +(define_insn "load_pair_lanes<mode>" + [(set (match_operand:<VDBL> 0 "register_operand" "=w") + (vec_concat:<VDBL> + (match_operand:VDC 1 "memory_operand" "Utq") + (match_operand:VDC 2 "memory_operand" "m")))] + "TARGET_SIMD && !STRICT_ALIGNMENT + && rtx_equal_p (XEXP (operands[2], 0), + plus_constant (Pmode, + XEXP (operands[1], 0), + GET_MODE_SIZE (<MODE>mode)))" + "ldr\\t%q0, %1" + [(set_attr "type" "neon_load1_1reg_q")] +) + ;; In this insn, operand 1 should be low, and operand 2 the high part of the ;; dest vector. diff --git a/gcc/config/aarch64/constraints.md b/gcc/config/aarch64/constraints.md index b8293376fde7e03c4cfc2a6ad6268201f487eb92..ab607b9f7488e903a14fe93e88d4c4e1fad762b3 100644 --- a/gcc/config/aarch64/constraints.md +++ b/gcc/config/aarch64/constraints.md @@ -161,6 +161,13 @@ (define_memory_constraint "Utv" (and (match_code "mem") (match_test "aarch64_simd_mem_operand_p (op)"))) +(define_memory_constraint "Utq" + "@internal + An address valid for loading or storing a 128-bit AdvSIMD register" + (and (match_code "mem") + (match_test "aarch64_legitimate_address_p (V2DImode, XEXP (op, 0), + MEM, 1)"))) + (define_constraint "Ufc" "A floating point constant which can be used with an\ FMOV immediate operation." diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c index 42824b6c61af37f6b005de75bd1e5ebe7522bdba..a4aebae68afc14a69870e1fd280d28251aa5f398 100644 --- a/gcc/simplify-rtx.c +++ b/gcc/simplify-rtx.c @@ -5701,6 +5701,25 @@ simplify_ternary_operation (enum rtx_code code, machine_mode mode, std::swap (newop0, newop1); return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1); } + /* Replace (vec_merge (vec_duplicate x) (vec_concat (y) (z)) (const_int N)) + with (vec_concat x z) if N == 1, or (vec_concat y x) if N == 2. + Only applies for vectors of two elements. */ + if (GET_CODE (op0) == VEC_DUPLICATE + && GET_CODE (op1) == VEC_CONCAT + && GET_MODE_NUNITS (GET_MODE (op0)) == 2 + && GET_MODE_NUNITS (GET_MODE (op1)) == 2 + && IN_RANGE (sel, 1, 2)) + { + rtx newop0 = XEXP (op0, 0); + rtx newop1 = XEXP (op1, 2 - sel); + rtx otherop = XEXP (op1, sel - 1); + if (sel == 2) + std::swap (newop0, newop1); + /* Don't want to throw away the other part of the vec_concat if + it has side-effects. */ + if (!side_effects_p (otherop)) + return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1); + } } if (rtx_equal_p (op0, op1) diff --git a/gcc/testsuite/gcc.target/aarch64/load_v2vec_lanes_1.c b/gcc/testsuite/gcc.target/aarch64/load_v2vec_lanes_1.c new file mode 100644 index 0000000000000000000000000000000000000000..3c31b340154b5469fca858a579e9a6ab90ee0d22 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/load_v2vec_lanes_1.c @@ -0,0 +1,26 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ + +typedef long long v2di __attribute__ ((vector_size (16))); +typedef double v2df __attribute__ ((vector_size (16))); + +v2di +construct_lanedi (long long *y) +{ + v2di x = { y[0], y[1] }; + return x; +} + +v2df +construct_lanedf (double *y) +{ + v2df x = { y[0], y[1] }; + return x; +} + +/* We can use the load_pair_lanes<mode> pattern to vec_concat two DI/DF + values from consecutive memory into a 2-element vector by using + a Q-reg LDR. */ + +/* { dg-final { scan-assembler-times "ldr\tq\[0-9\]+" 2 } } */ +/* { dg-final { scan-assembler-not "ins\t" } } */