Message ID | 53077F31.8070003@arm.com |
---|---|
State | New |
Headers | show |
I've been doing some local testing using this patch as a basis for some of my own work on NEON intrinsics, and it seems good to me. A couple of points: (1) Re. the comment that "If two vectors, we end up with a wierd mixed-endian mode on NEON": firstly "wierd" should be spelt "weird"; secondly, if I understand right, this comment belongs with the next "if (!d->one_vector_p...)" rather than the "if (BYTES_BIG_ENDIAN)" before which it's written. (2) as you say, this code is not exercised, unless you do something to remove the 'if (BYTES_BIG_ENDIAN) return false;' earlier in that same function. Can I politely suggest you do that here in this patch? (3) In my own regression testing, with const_vec_perm enabled on big_endian, I see 2*PASS->FAIL, namely gcc.dg/vect/vect-114.c scan-tree-dump-times vect "vectorized 0 loops" 1 gcc.dg/vect/vect-114.c -flto -ffat-lto-objects scan-tree-dump-times vect "vectorized 0 loops" 1 These are essentially noise, but the noise is removed and I see no other problems, if (after this patch) I re-enable the testsuite's "vect_perm" target selector for aarch64 big-endian (testsuite/lib/target-supports.exp). Would you like a separate patch for that or roll it in here? Cheers, Alan Tejas Belagod wrote: > > Hi, > > > > When a shuffle of more than one input happens, on NEON we end up with a > > 'mixed-endian' format in the register list which TBL operates on. We don't make > > this correction in RTL and therefore the shuffle operation gets it incorrect. > > Here is a patch that fixes-up the index table in the selector rtx in RTL to also > > be mixed-endian to reflect what's happening on NEON. > > > > As trunk stands, this patch will not be exercised as constant vector permute for > > Big-endian is disabled. I've tested this by locally enabling const vec_perm and > > it fixes the some regressions we have on big-endian: > > > > aarch64_be-none-elf: > > FAIL->PASS: gcc.c-torture/execute/loop-11.c execution, -O3 -fomit-frame-pointer > > FAIL->PASS: gcc.c-torture/execute/loop-11.c execution, -O3 -fomit-frame-pointer > > -funroll-all-loops -finline-functions > > FAIL->PASS: gcc.c-torture/execute/loop-11.c execution, -O3 -fomit-frame-pointer > > -funroll-loops > > FAIL->PASS: gcc.c-torture/execute/loop-11.c execution, -O3 -g > > FAIL->PASS: gcc.dg/torture/vector-shuffle1.c -O0 execution test > > FAIL->PASS: gcc.dg/torture/vshuf-v16qi.c -O2 execution test > > FAIL->PASS: gcc.dg/torture/vshuf-v2df.c -O2 execution test > > FAIL->PASS: gcc.dg/torture/vshuf-v2di.c -O2 execution test > > FAIL->PASS: gcc.dg/torture/vshuf-v2sf.c -O2 execution test > > FAIL->PASS: gcc.dg/torture/vshuf-v2si.c -O2 execution test > > FAIL->PASS: gcc.dg/torture/vshuf-v4sf.c -O2 execution test > > FAIL->PASS: gcc.dg/torture/vshuf-v4si.c -O2 execution test > > FAIL->PASS: gcc.dg/torture/vshuf-v8hi.c -O2 execution test > > FAIL->PASS: gcc.dg/torture/vshuf-v8qi.c -O2 execution test > > FAIL->PASS: gcc.dg/vect/vect-114.c -flto -ffat-lto-objects execution test > > FAIL->PASS: gcc.dg/vect/vect-114.c execution test > > FAIL->PASS: gcc.dg/vect/vect-15.c -flto -ffat-lto-objects execution test > > FAIL->PASS: gcc.dg/vect/vect-15.c execution test > > > > Also regressed on aarch64-none-elf. > > > > OK for stage-1? > > > > Thanks, > > Tejas. > > > > 2014-02-21 Tejas Belagod <tejas.belagod@arm.com> > > > > gcc/ > > * config/aarch64/aarch64.c (aarch64_evpc_tbl): Fix index vector for > > big-endian when dealing with more than one input shuffle vector. > >
Further to that - all looks good after one-liner http://gcc.gnu.org/ml/gcc-patches/2014-03/msg01142.html has gone in. (Without that, enabling the code in Tejas' patch causes a regression in gcc.dg/torture/vshuf-v4hi.c as loading a vector constant goes wrong). I'll send a patch to enable this and fix up the testsuite shortly... Cheers, Alan Alan Lawrence wrote: > I've been doing some local testing using this patch as a basis for some of my > own work on NEON intrinsics, and it seems good to me. A couple of points: > > (1) Re. the comment that "If two vectors, we end up with a wierd mixed-endian > mode on NEON": firstly "wierd" should be spelt "weird"; > secondly, if I understand right, this comment belongs with the next "if > (!d->one_vector_p...)" rather than the "if (BYTES_BIG_ENDIAN)" before which it's > written. > > (2) as you say, this code is not exercised, unless you do something to remove > the 'if (BYTES_BIG_ENDIAN) return false;' earlier in that same function. Can I > politely suggest you do that here in this patch? > > (3) In my own regression testing, with const_vec_perm enabled on big_endian, I > see 2*PASS->FAIL, namely > > gcc.dg/vect/vect-114.c scan-tree-dump-times vect "vectorized 0 loops" 1 > > gcc.dg/vect/vect-114.c -flto -ffat-lto-objects scan-tree-dump-times > vect "vectorized 0 loops" 1 > > These are essentially noise, but the noise is removed and I see no other > problems, if (after this patch) I re-enable the testsuite's "vect_perm" target > selector for aarch64 big-endian (testsuite/lib/target-supports.exp). Would you > like a separate patch for that or roll it in here? > > Cheers, Alan > > Tejas Belagod wrote: >> Hi, >> >> When a shuffle of more than one input happens, on NEON we end up with a >> 'mixed-endian' format in the register list which TBL operates on. We don't make >> this correction in RTL and therefore the shuffle operation gets it incorrect. >> Here is a patch that fixes-up the index table in the selector rtx in RTL to also >> be mixed-endian to reflect what's happening on NEON. >> >> As trunk stands, this patch will not be exercised as constant vector permute for >> Big-endian is disabled. I've tested this by locally enabling const vec_perm and >> it fixes the some regressions we have on big-endian: >> >> aarch64_be-none-elf: >> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution, -O3 -fomit-frame-pointer >> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution, -O3 -fomit-frame-pointer >> -funroll-all-loops -finline-functions >> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution, -O3 -fomit-frame-pointer >> -funroll-loops >> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution, -O3 -g >> FAIL->PASS: gcc.dg/torture/vector-shuffle1.c -O0 execution test >> FAIL->PASS: gcc.dg/torture/vshuf-v16qi.c -O2 execution test >> FAIL->PASS: gcc.dg/torture/vshuf-v2df.c -O2 execution test >> FAIL->PASS: gcc.dg/torture/vshuf-v2di.c -O2 execution test >> FAIL->PASS: gcc.dg/torture/vshuf-v2sf.c -O2 execution test >> FAIL->PASS: gcc.dg/torture/vshuf-v2si.c -O2 execution test >> FAIL->PASS: gcc.dg/torture/vshuf-v4sf.c -O2 execution test >> FAIL->PASS: gcc.dg/torture/vshuf-v4si.c -O2 execution test >> FAIL->PASS: gcc.dg/torture/vshuf-v8hi.c -O2 execution test >> FAIL->PASS: gcc.dg/torture/vshuf-v8qi.c -O2 execution test >> FAIL->PASS: gcc.dg/vect/vect-114.c -flto -ffat-lto-objects execution test >> FAIL->PASS: gcc.dg/vect/vect-114.c execution test >> FAIL->PASS: gcc.dg/vect/vect-15.c -flto -ffat-lto-objects execution test >> FAIL->PASS: gcc.dg/vect/vect-15.c execution test >> >> Also regressed on aarch64-none-elf. >> >> OK for stage-1? >> >> Thanks, >> Tejas. >> >> 2014-02-21 Tejas Belagod <tejas.belagod@arm.com> >> >> gcc/ >> * config/aarch64/aarch64.c (aarch64_evpc_tbl): Fix index vector for >> big-endian when dealing with more than one input shuffle vector. >> > >
On 02/21/2014 08:30 AM, Tejas Belagod wrote: > + /* If two vectors, we end up with a wierd mixed-endian mode on NEON. */ > + if (BYTES_BIG_ENDIAN) > + { > + if (!d->one_vector_p && d->perm[i] & nunits) > + { > + /* Extract the offset. */ > + elt = d->perm[i] & (nunits - 1); > + /* Reverse the top half. */ > + elt = nunits - 1 - elt; > + /* Offset it by the bottom half. */ > + elt += nunits; > + } > + else > + elt = nunits - 1 - d->perm[i]; > + } Isn't this just elt = d->perm[i] ^ (nunits - 1); all the time? I.e. invert the index within the word, but leave the word index (nunits) unchanged. r~
Alan Lawrence wrote: > Further to that - all looks good after one-liner > http://gcc.gnu.org/ml/gcc-patches/2014-03/msg01142.html has gone in. (Without > that, enabling the code in Tejas' patch causes a regression in > gcc.dg/torture/vshuf-v4hi.c as loading a vector constant goes wrong). > Thanks for your fix. Tejas. > I'll send a patch to enable this and fix up the testsuite shortly... > > Cheers, Alan > > Alan Lawrence wrote: >> I've been doing some local testing using this patch as a basis for some of my >> own work on NEON intrinsics, and it seems good to me. A couple of points: >> >> (1) Re. the comment that "If two vectors, we end up with a wierd mixed-endian >> mode on NEON": firstly "wierd" should be spelt "weird"; >> secondly, if I understand right, this comment belongs with the next "if >> (!d->one_vector_p...)" rather than the "if (BYTES_BIG_ENDIAN)" before which it's >> written. >> >> (2) as you say, this code is not exercised, unless you do something to remove >> the 'if (BYTES_BIG_ENDIAN) return false;' earlier in that same function. Can I >> politely suggest you do that here in this patch? >> >> (3) In my own regression testing, with const_vec_perm enabled on big_endian, I >> see 2*PASS->FAIL, namely >> >> gcc.dg/vect/vect-114.c scan-tree-dump-times vect "vectorized 0 loops" 1 >> >> gcc.dg/vect/vect-114.c -flto -ffat-lto-objects scan-tree-dump-times >> vect "vectorized 0 loops" 1 >> >> These are essentially noise, but the noise is removed and I see no other >> problems, if (after this patch) I re-enable the testsuite's "vect_perm" target >> selector for aarch64 big-endian (testsuite/lib/target-supports.exp). Would you >> like a separate patch for that or roll it in here? >> >> Cheers, Alan >> >> Tejas Belagod wrote: >>> Hi, >>> >>> When a shuffle of more than one input happens, on NEON we end up with a >>> 'mixed-endian' format in the register list which TBL operates on. We don't make >>> this correction in RTL and therefore the shuffle operation gets it incorrect. >>> Here is a patch that fixes-up the index table in the selector rtx in RTL to also >>> be mixed-endian to reflect what's happening on NEON. >>> >>> As trunk stands, this patch will not be exercised as constant vector permute for >>> Big-endian is disabled. I've tested this by locally enabling const vec_perm and >>> it fixes the some regressions we have on big-endian: >>> >>> aarch64_be-none-elf: >>> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution, -O3 -fomit-frame-pointer >>> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution, -O3 -fomit-frame-pointer >>> -funroll-all-loops -finline-functions >>> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution, -O3 -fomit-frame-pointer >>> -funroll-loops >>> FAIL->PASS: gcc.c-torture/execute/loop-11.c execution, -O3 -g >>> FAIL->PASS: gcc.dg/torture/vector-shuffle1.c -O0 execution test >>> FAIL->PASS: gcc.dg/torture/vshuf-v16qi.c -O2 execution test >>> FAIL->PASS: gcc.dg/torture/vshuf-v2df.c -O2 execution test >>> FAIL->PASS: gcc.dg/torture/vshuf-v2di.c -O2 execution test >>> FAIL->PASS: gcc.dg/torture/vshuf-v2sf.c -O2 execution test >>> FAIL->PASS: gcc.dg/torture/vshuf-v2si.c -O2 execution test >>> FAIL->PASS: gcc.dg/torture/vshuf-v4sf.c -O2 execution test >>> FAIL->PASS: gcc.dg/torture/vshuf-v4si.c -O2 execution test >>> FAIL->PASS: gcc.dg/torture/vshuf-v8hi.c -O2 execution test >>> FAIL->PASS: gcc.dg/torture/vshuf-v8qi.c -O2 execution test >>> FAIL->PASS: gcc.dg/vect/vect-114.c -flto -ffat-lto-objects execution test >>> FAIL->PASS: gcc.dg/vect/vect-114.c execution test >>> FAIL->PASS: gcc.dg/vect/vect-15.c -flto -ffat-lto-objects execution test >>> FAIL->PASS: gcc.dg/vect/vect-15.c execution test >>> >>> Also regressed on aarch64-none-elf. >>> >>> OK for stage-1? >>> >>> Thanks, >>> Tejas. >>> >>> 2014-02-21 Tejas Belagod <tejas.belagod@arm.com> >>> >>> gcc/ >>> * config/aarch64/aarch64.c (aarch64_evpc_tbl): Fix index vector for >>> big-endian when dealing with more than one input shuffle vector. >>> >> >
Richard Henderson wrote: > On 02/21/2014 08:30 AM, Tejas Belagod wrote: >> + /* If two vectors, we end up with a wierd mixed-endian mode on NEON. */ >> + if (BYTES_BIG_ENDIAN) >> + { >> + if (!d->one_vector_p && d->perm[i] & nunits) >> + { >> + /* Extract the offset. */ >> + elt = d->perm[i] & (nunits - 1); >> + /* Reverse the top half. */ >> + elt = nunits - 1 - elt; >> + /* Offset it by the bottom half. */ >> + elt += nunits; >> + } >> + else >> + elt = nunits - 1 - d->perm[i]; >> + } > > Isn't this just > > elt = d->perm[i] ^ (nunits - 1); > > all the time? I.e. invert the index within the word, > but leave the word index (nunits) unchanged. > Yes, I think that works. Thanks! Tejas.
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index ea90311..fd473a3 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -8128,7 +8128,28 @@ aarch64_evpc_tbl (struct expand_vec_perm_d *d) return false; for (i = 0; i < nelt; ++i) - rperm[i] = GEN_INT (d->perm[i]); + { + int nunits = GET_MODE_NUNITS (vmode); + int elt = d->perm[i]; + + /* If two vectors, we end up with a wierd mixed-endian mode on NEON. */ + if (BYTES_BIG_ENDIAN) + { + if (!d->one_vector_p && d->perm[i] & nunits) + { + /* Extract the offset. */ + elt = d->perm[i] & (nunits - 1); + /* Reverse the top half. */ + elt = nunits - 1 - elt; + /* Offset it by the bottom half. */ + elt += nunits; + } + else + elt = nunits - 1 - d->perm[i]; + } + + rperm[i] = GEN_INT (elt); + } sel = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (nelt, rperm)); sel = force_reg (vmode, sel);