Message ID | alpine.DEB.2.02.1204171927240.5127@laptop-mg.saclay.inria.fr |
---|---|
State | New |
Headers | show |
Ping? http://gcc.gnu.org/ml/gcc-patches/2012-04/msg01034.html Since then, I've run a c,c++ bootstrap and: make -k check RUNTESTFLAGS="--target_board=my-sde-sim" where my-sde-sim is the dejagnu board posted by H.J. Lu to run tests inside Intel's simulator, no difference between before and after my patch. (If I understand correctly, the testsuite always compiles the AVX and AVX2 tests, and uses cpuid (which I expect the simulator must fake) to determine if it should run them, so I don't need to pass any extra flag in RUNTESTFLAGS. If I am wrong, please tell me.) Adding in Cc: the 2 people who kindly commented on the other shuffle patch (the one that isn't finished). On Tue, 17 Apr 2012, Marc Glisse wrote: > Hello, > > this patch expands __builtin_shuffle for V4DF mode in at most 3 insn. It is > simple and works really well, often generates only 2 insn. It is not very > generic, because other modes don't have an instruction equivalent to vshufpd. > For V8SF (and likely V4DI and V8SI with AVX2, but I still need to do that), > my patch "default case" in PR 52607 seems more interesting. > > I tried calling this new function after expand_vec_perm_vperm2f128_vblend > (instead of before as in the patch), but it generated more instructions for > some permutations, and never less. That function is still useful for V8SF > though. > > I bootstrapped gcc on a non-avx platform, compiled a program that tests all > 4096 shuffles with -mavx/-mavx2, and ran the result using Intel's emulator > (SDE). > > There are still a few V4DF permutations that don't generate an optimal > sequence (3 insn instead of 2), but not that many I think. Of course, I am > assuming a constant cost of 1 per insn, which is completely false, but seems > like a sensible first approximation. > > (note that I can't commit) > > > 2012-04-17 Marc Glisse <marc.glisse@inria.fr> > > PR target/502607 > * config/i386/i386.c (ix86_expand_vec_perm_const): Move code to ... > (canonicalize_perm): ... new function. > (expand_vec_perm_2vperm2f128_vshuf): New function. > (ix86_expand_vec_perm_const_1): Call it.
Any comment? On Mon, 30 Apr 2012, Marc Glisse wrote: > Ping? > > http://gcc.gnu.org/ml/gcc-patches/2012-04/msg01034.html > > Since then, I've run a c,c++ bootstrap and: > make -k check RUNTESTFLAGS="--target_board=my-sde-sim" > where my-sde-sim is the dejagnu board posted by H.J. Lu to run tests inside > Intel's simulator, no difference between before and after my patch. > (If I understand correctly, the testsuite always compiles the AVX and AVX2 > tests, and uses cpuid (which I expect the simulator must fake) to determine > if it should run them, so I don't need to pass any extra flag in > RUNTESTFLAGS. If I am wrong, please tell me.) > > Adding in Cc: the 2 people who kindly commented on the other shuffle patch > (the one that isn't finished). > > On Tue, 17 Apr 2012, Marc Glisse wrote: > >> Hello, >> >> this patch expands __builtin_shuffle for V4DF mode in at most 3 insn. It is >> simple and works really well, often generates only 2 insn. It is not very >> generic, because other modes don't have an instruction equivalent to >> vshufpd. For V8SF (and likely V4DI and V8SI with AVX2, but I still need to >> do that), my patch "default case" in PR 52607 seems more interesting. >> >> I tried calling this new function after expand_vec_perm_vperm2f128_vblend >> (instead of before as in the patch), but it generated more instructions for >> some permutations, and never less. That function is still useful for V8SF >> though. >> >> I bootstrapped gcc on a non-avx platform, compiled a program that tests all >> 4096 shuffles with -mavx/-mavx2, and ran the result using Intel's emulator >> (SDE). >> >> There are still a few V4DF permutations that don't generate an optimal >> sequence (3 insn instead of 2), but not that many I think. Of course, I am >> assuming a constant cost of 1 per insn, which is completely false, but >> seems like a sensible first approximation. >> >> (note that I can't commit) >> >> >> 2012-04-17 Marc Glisse <marc.glisse@inria.fr> >> >> PR target/502607 >> * config/i386/i386.c (ix86_expand_vec_perm_const): Move code to ... >> (canonicalize_perm): ... new function. >> (expand_vec_perm_2vperm2f128_vshuf): New function. >> (ix86_expand_vec_perm_const_1): Call it.
On 04/17/12 11:03, Marc Glisse wrote: > 2012-04-17 Marc Glisse <marc.glisse@inria.fr> > > PR target/502607 > * config/i386/i386.c (ix86_expand_vec_perm_const): Move code to ... > (canonicalize_perm): ... new function. > (expand_vec_perm_2vperm2f128_vshuf): New function. > (ix86_expand_vec_perm_const_1): Call it. Looks good. r~
Index: config/i386/i386.c =================================================================== --- config/i386/i386.c (revision 186523) +++ config/i386/i386.c (working copy) @@ -32946,6 +32946,7 @@ bool testing_p; }; +static bool canonicalize_perm (struct expand_vec_perm_d *d); static bool expand_vec_perm_1 (struct expand_vec_perm_d *d); static bool expand_vec_perm_broadcast_1 (struct expand_vec_perm_d *d); @@ -37003,6 +37004,57 @@ return true; } +/* A subroutine of ix86_expand_vec_perm_builtin_1. Implement a V4DF + permutation using two vperm2f128, followed by a vshufpd insn blending + the two vectors together. */ + +static bool +expand_vec_perm_2vperm2f128_vshuf (struct expand_vec_perm_d *d) +{ + struct expand_vec_perm_d dfirst, dsecond, dthird; + bool ok; + + if (!TARGET_AVX || (d->vmode != V4DFmode)) + return false; + + if (d->testing_p) + return true; + + dfirst = *d; + dsecond = *d; + dthird = *d; + + dfirst.perm[0] = (d->perm[0] & ~1); + dfirst.perm[1] = (d->perm[0] & ~1) + 1; + dfirst.perm[2] = (d->perm[2] & ~1); + dfirst.perm[3] = (d->perm[2] & ~1) + 1; + dsecond.perm[0] = (d->perm[1] & ~1); + dsecond.perm[1] = (d->perm[1] & ~1) + 1; + dsecond.perm[2] = (d->perm[3] & ~1); + dsecond.perm[3] = (d->perm[3] & ~1) + 1; + dthird.perm[0] = (d->perm[0] % 2); + dthird.perm[1] = (d->perm[1] % 2) + 4; + dthird.perm[2] = (d->perm[2] % 2) + 2; + dthird.perm[3] = (d->perm[3] % 2) + 6; + + dfirst.target = gen_reg_rtx (dfirst.vmode); + dsecond.target = gen_reg_rtx (dsecond.vmode); + dthird.op0 = dfirst.target; + dthird.op1 = dsecond.target; + dthird.one_operand_p = false; + + canonicalize_perm (&dfirst); + canonicalize_perm (&dsecond); + + ok = expand_vec_perm_1 (&dfirst) + && expand_vec_perm_1 (&dsecond) + && expand_vec_perm_1 (&dthird); + + gcc_assert (ok); + + return true; +} + /* A subroutine of expand_vec_perm_even_odd_1. Implement the double-word permutation with two pshufb insns and an ior. We should have already failed all two instruction sequences. */ @@ -37652,6 +37704,9 @@ /* Try sequences of three instructions. */ + if (expand_vec_perm_2vperm2f128_vshuf (d)) + return true; + if (expand_vec_perm_pshufb2 (d)) return true; @@ -37689,12 +37744,56 @@ return false; } +/* If a permutation only uses one operand, make it clear. Returns true + if the permutation references both operands. */ + +static bool +canonicalize_perm (struct expand_vec_perm_d *d) +{ + int i, which, nelt = d->nelt; + + for (i = which = 0; i < nelt; ++i) + which |= (d->perm[i] < nelt ? 1 : 2); + + d->one_operand_p = true; + switch (which) + { + default: + gcc_unreachable(); + + case 3: + if (!rtx_equal_p (d->op0, d->op1)) + { + d->one_operand_p = false; + break; + } + /* The elements of PERM do not suggest that only the first operand + is used, but both operands are identical. Allow easier matching + of the permutation by folding the permutation into the single + input vector. */ + /* FALLTHRU */ + + case 2: + for (i = 0; i < nelt; ++i) + d->perm[i] &= nelt - 1; + d->op0 = d->op1; + break; + + case 1: + d->op1 = d->op0; + break; + } + + return (which == 3); +} + bool ix86_expand_vec_perm_const (rtx operands[4]) { struct expand_vec_perm_d d; unsigned char perm[MAX_VECT_LEN]; - int i, nelt, which; + int i, nelt; + bool two_args; rtx sel; d.target = operands[0]; @@ -37711,45 +37810,16 @@ gcc_assert (XVECLEN (sel, 0) == nelt); gcc_checking_assert (sizeof (d.perm) == sizeof (perm)); - for (i = which = 0; i < nelt; ++i) + for (i = 0; i < nelt; ++i) { rtx e = XVECEXP (sel, 0, i); int ei = INTVAL (e) & (2 * nelt - 1); - - which |= (ei < nelt ? 1 : 2); d.perm[i] = ei; perm[i] = ei; } - d.one_operand_p = true; - switch (which) - { - default: - gcc_unreachable(); + two_args = canonicalize_perm (&d); - case 3: - if (!rtx_equal_p (d.op0, d.op1)) - { - d.one_operand_p = false; - break; - } - /* The elements of PERM do not suggest that only the first operand - is used, but both operands are identical. Allow easier matching - of the permutation by folding the permutation into the single - input vector. */ - /* FALLTHRU */ - - case 2: - for (i = 0; i < nelt; ++i) - d.perm[i] &= nelt - 1; - d.op0 = d.op1; - break; - - case 1: - d.op1 = d.op0; - break; - } - if (ix86_expand_vec_perm_const_1 (&d)) return true; @@ -37757,7 +37827,7 @@ same, the above tried to expand with one_operand_p and flattened selector. If that didn't work, retry without one_operand_p; we succeeded with that during testing. */ - if (which == 3 && d.one_operand_p) + if (two_args && d.one_operand_p) { d.one_operand_p = false; memcpy (d.perm, perm, sizeof (perm));