Message ID | 20130227172947.31fa279c@octopus |
---|---|
State | New |
Headers | show |
On 02/27/2013 09:29 AM, Julian Brown wrote: > Index: gcc/testsuite/gcc.dg/vect/slp-cond-3.c > =================================================================== > --- gcc/testsuite/gcc.dg/vect/slp-cond-3.c (revision 196170) > +++ gcc/testsuite/gcc.dg/vect/slp-cond-3.c (working copy) > @@ -79,6 +79,6 @@ int main () > return 0; > } > > -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" } } */ > +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { ! vect_unpack } } } } */ > /* { dg-final { cleanup-tree-dump "vect" } } */ > If this and other modified checks only fail for ARM big-endian then they should check for that so they don't XPASS for other targets. It's also possible now to do things like { target vect_blah xfail arm_big_endian }, which might be useful for some tests. Janis
On Wed, 27 Feb 2013 11:04:04 -0800 Janis Johnson <janis_johnson@mentor.com> wrote: > On 02/27/2013 09:29 AM, Julian Brown wrote: > > Index: gcc/testsuite/gcc.dg/vect/slp-cond-3.c > > =================================================================== > > --- gcc/testsuite/gcc.dg/vect/slp-cond-3.c (revision 196170) > > +++ gcc/testsuite/gcc.dg/vect/slp-cond-3.c (working copy) > > @@ -79,6 +79,6 @@ int main () > > return 0; > > } > > > > -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" > > 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorizing > > stmts using SLP" 1 "vect" { xfail { ! vect_unpack } } } } */ /* > > { dg-final { cleanup-tree-dump "vect" } } */ > > If this and other modified checks only fail for ARM big-endian then > they should check for that so they don't XPASS for other targets. > It's also possible now to do things like { target vect_blah xfail > arm_big_endian }, which might be useful for some tests. I don't think I understand -- my expectation was e.g. that that test would fail for any target which doesn't support vect_unpack. Surely you'd only get an XPASS if the test passed when vect_unpack was not true? I'm not sure why checking for a particular architecture-specific predicate would be preferable to checking that a general feature is supported. As time progresses, it might well be that e.g. vect_unpack becomes supported for big-endian ARM, at which point we shouldn't need to edit all the individual tests again... Thanks, Julian
On 02/28/2013 02:06 AM, Julian Brown wrote: > On Wed, 27 Feb 2013 11:04:04 -0800 > Janis Johnson <janis_johnson@mentor.com> wrote: > >> On 02/27/2013 09:29 AM, Julian Brown wrote: >>> Index: gcc/testsuite/gcc.dg/vect/slp-cond-3.c >>> =================================================================== >>> --- gcc/testsuite/gcc.dg/vect/slp-cond-3.c (revision 196170) >>> +++ gcc/testsuite/gcc.dg/vect/slp-cond-3.c (working copy) >>> @@ -79,6 +79,6 @@ int main () >>> return 0; >>> } >>> >>> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" >>> 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorizing >>> stmts using SLP" 1 "vect" { xfail { ! vect_unpack } } } } */ /* >>> { dg-final { cleanup-tree-dump "vect" } } */ >> >> If this and other modified checks only fail for ARM big-endian then >> they should check for that so they don't XPASS for other targets. >> It's also possible now to do things like { target vect_blah xfail >> arm_big_endian }, which might be useful for some tests. > > I don't think I understand -- my expectation was e.g. that that test > would fail for any target which doesn't support vect_unpack. Surely > you'd only get an XPASS if the test passed when vect_unpack was not > true? Right. Please ignore my mail, I was confused. > I'm not sure why checking for a particular architecture-specific > predicate would be preferable to checking that a general feature is > supported. As time progresses, it might well be that e.g. vect_unpack > becomes supported for big-endian ARM, at which point we shouldn't need > to edit all the individual tests again... Right. Once again, I was confused, ignore me. Janis
On Wed, Feb 27, 2013 at 6:29 PM, Julian Brown <julian@codesourcery.com> wrote: > Hi, > > Several new (ish?) autovectorizer features have apparently caused NEON > support for same to regress quite heavily in big-endian mode. This > patch is an attempt to fix things up, but is not without problems -- > maybe someone will have a suggestion as to how we should proceed. > > The problem (as ever) is that the ARM backend must lie to the > middle-end about the layout of NEON vectors in big-endian mode (due to > ABI requirements, VFP compatibility, and the middle-end semantics of > vector indices being equivalent to those of an array with the same type > of elements when stored in memory). Why not simply give up? Thus, make autovectorization unsupported for ARM big-endian targets? Do I understand correctly that the "only" issue is memory vs. register element ordering? Thus a fixup could be as simple as extra shuffles inserted after vector memory loads and before vector memory stores? (with the hope of RTL optimizers optimizing those)? Any "lies" are of course bad and you'll pay for them later. Richard. > A few years ago when the vectorizer > was relatively less sophisticated, the ordering of vector elements > could be ignored to some extent by disabling certain instruction > patterns used by the vectorizer in big-endian mode which were sensitive > to the ordering of elements: in fact this is still the strategy we're > using, but it is clearly becoming less and less tenable as time > progresses. Quad-word registers (being composed of two double-word > registers, loaded/stored the "wrong way round" in big-endian mode) > arguably cause more problems than double-word registers. > > So, the idea behind the attached patch was supposed to be to limit the > autovectorizer to using double-word registers only, and to disable a > few additional (or newly-used by the vectorizer) patterns in big-endian > mode. That, plus several testsuite tweaks, gets us down to zero > failures for vect.exp, which is good. > > The problem is that at the same time quite a large set of neon.exp tests > regress (vzip/vuzp/vtrn): one of the new patterns which is > disabled because it causes trouble (i.e. execution failures) for the > vectorizer is vec_perm_const<mode>. However __builtin_shuffle (which > uses that pattern) is used for arm_neon.h now -- so disabling it means > that the proper instructions aren't generated for intrinsics any more in > big-endian mode. > > I think we have a problem here. The vectorizer also tries to use > __builtin_shuffle (for scatter/gather operations, when lane > loading/storing ops aren't available), but does not understand the > "special tweaks" that arm_evpc_neon_{vuzp,vzip,vtrn} does to try to > hide the true element ordering of vectors from the middle-end. So, I'm > left wondering: > > * Given our funky element ordering in BE mode, are the > __builtin_shuffle lists in arm_neon.h actually an accurate > representation of what the given intrinsic should do? (The fallback > code might or might not do the same thing, I'm not sure.) > > * The vectorizer tries to use VEC_PERM_EXPR (equivalent to > __builtin_shuffle) with e.g. pairs of doubleword registers loaded > from adjacent memory locations. Are the semantics required for this > (again, with our funky element ordering) even the same as those > required for the intrinsics? Including quad-word registers for the > latter? (My suspicion is "no", in which case there's a fundamental > incompatibility here that needs to be resolved somehow.) > > Anyway: the tl;dr is "fixing NEON vect tests breaks intrinsics". Any > ideas for what to do about that? (FAOD, I don't think I'm in a position > to do the kind of middle-end surgery required to fix the problem > "properly" at this point :-p). > > (It's arguably more important for the vectorizer to not generate bad > code than it is for intrinsics to work properly, in which case: OK to > apply? Tested cross to ARM EABI with configury modifications to build > LE/BE multilibs.) > > Thanks, > > Julian > > ChangeLog > > gcc/ > * config/arm/arm.c (arm_array_mode_supported_p): No array modes for > big-endian NEON. > (arm_preferred_simd_mode): Always prefer 64-bit modes for > big-endian NEON. > (arm_autovectorize_vector_sizes): Use 8-byte vectors only for NEON. > (arm_vectorize_vec_perm_const_ok): No permutations are OK in > big-endian mode. > * config/arm/neon.md (vec_load_lanes<mode><mode>): Disable in > big-endian mode. > (vec_store_lanes<mode><mode>, vec_load_lanesti<mode>) > (vec_load_lanesoi<mode>, vec_store_lanesti<mode>) > (vec_store_lanesoi<mode>, vec_load_lanesei<mode>) > (vec_load_lanesci<mode>, vec_store_lanesei<mode>) > (vec_store_lanesci<mode>, vec_load_lanesxi<mode>) > (vec_store_lanesxi<mode>): Likewise. > (vec_widen_<US>shiftl_lo_<mode>, vec_widen_<US>shiftl_hi_<mode>) > (vec_widen_<US>mult_hi_<mode>, vec_widen_<US>mult_lo_<mode>): > Likewise. > > gcc/testsuite/ > * gcc.dg/vect/slp-cond-3.c: XFAIL for !vect_unpack. > * gcc.dg/vect/slp-cond-4.c: Likewise. > * gcc.dg/vect/vect-1.c: Likewise. > * gcc.dg/vect/vect-1-big-array.c: Likewise. > * gcc.dg/vect/vect-35.c: Likewise. > * gcc.dg/vect/vect-35-big-array.c: Likewise. > * gcc.dg/vect/bb-slp-11.c: Likewise. > * gcc.dg/vect/bb-slp-26.c: Likewise. > * gcc.dg/vect/vect-over-widen-3-big-array.c: XFAIL > for !vect_element_align. > * gcc.dg/vect/vect-over-widen-1.c: Likewise. > * gcc.dg/vect/vect-over-widen-1-big-array.c: Likewise. > * gcc.dg/vect/vect-over-widen-2.c: Likewise. > * gcc.dg/vect/vect-over-widen-2-big-array.c: Likewise. > * gcc.dg/vect/vect-over-widen-3.c: Likewise. > * gcc.dg/vect/vect-over-widen-4.c: Likewise. > * gcc.dg/vect/vect-over-widen-4-big-array.c: Likewise. > * gcc.dg/vect/pr43430-2.c: Likewise. > * gcc.dg/vect/vect-widen-shift-u16.c: XFAIL for !vect_widen_shift > && !vect_unpack. > * gcc.dg/vect/vect-widen-shift-s8.c: Likewise. > * gcc.dg/vect/vect-widen-shift-u8.c: Likewise. > * gcc.dg/vect/vect-widen-shift-s16.c: Likewise. > * gcc.dg/vect/vect-93.c: Only run if !vect_intfloat_cvt. > * gcc.dg/vect/vect-intfloat-conversion-4a.c: Only run if > vect_unpack. > * gcc.dg/vect/vect-intfloat-conversion-4b.c: Likewise. > * lib/target-supports.exp (check_effective_target_vect_perm): Only > enable for NEON little-endian. > (check_effective_target_vect_widen_sum_qi_to_hi): Likewise. > (check_effective_target_vect_widen_mult_qi_to_hi): Likewise. > (check_effective_target_vect_widen_mult_hi_to_si): Likewise. > (check_effective_target_vect_widen_shift): Likewise. > (check_effective_target_vect_extract_even_odd): Likewise. > (check_effective_target_vect_interleave): Likewise. > (check_effective_target_vect_stridedN): Likewise. > (check_effective_target_vect_multiple_sizes): Likewise. > (check_effective_target_vect64): Enable for any NEON. >
On Fri, 1 Mar 2013 11:07:17 +0100 Richard Biener <richard.guenther@gmail.com> wrote: > On Wed, Feb 27, 2013 at 6:29 PM, Julian Brown > <julian@codesourcery.com> wrote: > > Hi, > > > > Several new (ish?) autovectorizer features have apparently caused > > NEON support for same to regress quite heavily in big-endian mode. > > This patch is an attempt to fix things up, but is not without > > problems -- maybe someone will have a suggestion as to how we > > should proceed. > > > > The problem (as ever) is that the ARM backend must lie to the > > middle-end about the layout of NEON vectors in big-endian mode (due > > to ABI requirements, VFP compatibility, and the middle-end > > semantics of vector indices being equivalent to those of an array > > with the same type of elements when stored in memory). > > Why not simply give up? Thus, make autovectorization unsupported for > ARM big-endian targets? That's certainly a tempting option... > Do I understand correctly that the "only" issue is memory vs. register > element ordering? Thus a fixup could be as simple as extra shuffles > inserted after vector memory loads and before vector memory stores? > (with the hope of RTL optimizers optimizing those)? It's not even necessary to use explicit shuffles -- NEON has perfectly good instructions for loading/storing vectors in the "right" order, in the form of vld1 & vst1. I'm afraid the solution to this problem might have been staring us in the face for years, which is simply to forbid vldr/vstr/vldm/vstm (the instructions which lead to weird element permutations in BE mode) for loading/storing NEON vectors altogether. That way the vectorizer gets what it wants, the intrinsics can continue to use __builtin_shuffle exactly as they are doing, and we get to remove all the bits which fiddle vector element numbering in BE mode in the ARM backend. I can't exactly remember why we didn't do that to start with. I think the problem was ABI-related, or to do with transferring NEON vectors to/from ARM registers when it was necessary to do that... I'm planning to do some archaeology to try to see if I can figure out a definitive answer. (Previous discussions include, e.g.: http://gcc.gnu.org/ml/gcc-patches/2009-11/msg00876.html http://gcc.gnu.org/ml/gcc-patches/2010-06/msg00409.html http://lists.linaro.org/pipermail/linaro-toolchain/2010-November/000437.html it looks like ABI boundaries require vldr/vstr/vldm/vstm ordering: maybe those can be treated as "opaque" transfers and continue to use the same instructions & ordering, but vld1/vst1 can be used everywhere else?) > Any "lies" are of course bad and you'll pay for them later. Indeed :-). Cheers, Julian
> > Do I understand correctly that the "only" issue is memory vs. register > > element ordering? Thus a fixup could be as simple as extra shuffles > > inserted after vector memory loads and before vector memory stores? > > (with the hope of RTL optimizers optimizing those)? > > It's not even necessary to use explicit shuffles -- NEON has perfectly > good instructions for loading/storing vectors in the "right" order, in > the form of vld1 & vst1. I'm afraid the solution to this problem might > have been staring us in the face for years, which is simply to forbid > vldr/vstr/vldm/vstm (the instructions which lead to weird element > permutations in BE mode) for loading/storing NEON vectors altogether. > That way the vectorizer gets what it wants, the intrinsics can continue > to use __builtin_shuffle exactly as they are doing, and we get to > remove all the bits which fiddle vector element numbering in BE mode in > the ARM backend. > > I can't exactly remember why we didn't do that to start with. I think > the problem was ABI-related, or to do with transferring NEON vectors > to/from ARM registers when it was necessary to do that... I'm planning > to do some archaeology to try to see if I can figure out a definitive > answer. The ABI defined vector types (uint32x4_t etc) are defined to be in vldm/vstm order. Paul
On Fri, 1 Mar 2013 14:35:05 +0000 Paul Brook <paul@codesourcery.com> wrote: > > It's not even necessary to use explicit shuffles -- NEON has > > perfectly good instructions for loading/storing vectors in the > > "right" order, in the form of vld1 & vst1. I'm afraid the solution > > to this problem might have been staring us in the face for years, > > which is simply to forbid vldr/vstr/vldm/vstm (the instructions > > which lead to weird element permutations in BE mode) for > > loading/storing NEON vectors altogether. That way the vectorizer > > gets what it wants, the intrinsics can continue to use > > __builtin_shuffle exactly as they are doing, and we get to remove > > all the bits which fiddle vector element numbering in BE mode in > > the ARM backend. > > > > I can't exactly remember why we didn't do that to start with. I > > think the problem was ABI-related, or to do with transferring NEON > > vectors to/from ARM registers when it was necessary to do that... > > I'm planning to do some archaeology to try to see if I can figure > > out a definitive answer. > > The ABI defined vector types (uint32x4_t etc) are defined to be in > vldm/vstm order. There's no conflict with the ABI-defined vector order -- the ABI (looking at AAPCS, IHI 0042D) describes "containerized" vectors which should be used to pass and return vector quantities at ABI boundaries, but I couldn't find any further restrictions. Internally to a function, we are still free to use vld1/vst1 vector ordering. Using "containerized"/opaque transfers, the bit pattern of a vector in one function (using vld1/vst1 ordering internally) will of course remain unchanged if passed to another function and using the same ordering there also. Actually making that work (especially efficiently) with GCC is a slightly different matter. Let's call vldm/vstm-ordered vectors "containerized" format, and vld1/vst1-ordered vectors "array" format. We need to do introduce the concept of marshalling vector arguments from array format to containerized format when passing them to a function, and unmarshalling those vector arguments back the other way on function entry. AFAICT, GCC does not have suitable infrastructure for implementing such functionality at present: consider that e.g. vectors passed by value on the stack should use containerized format, which means the called function cannot simply dereference the stack pointer to read the vector: void foo (int dummy1, int dummy2, int dummy3, int dummy4, v4si myvec) { v4si *myvec_ptr = &myvec; ... } Here the hypothetical "unmarshal" operation would need to do something like: add r0, sp, #myvec_offset vldm r0, {q0} add r0, sp, #myvec_temp_offset vst1.32 {q0}, [r0] /* myvec_ptr points to myvec_temp_offset. */ In many cases the marshall/unmarshall operations don't have to do anything except use vldr/vstr/vldm/vstm or the core-register transfer equivalents instead of vld1/vst1 for reading/writing vectors used as arguments, so we generally don't have to incur any overhead like that, though. I experimented with a patch which tried to do marshalling/unmarshalling in RTL, using DImode/TImode for the containerized format (splitting neon.md/*neon_mov<mode> into DImode/TImode versions for containerized vectors, and V*mode versions for array-format vectors with only vmov/vld1/vst1 alternatives, and tweaking several other target macros etc. appropriately). but that didn't work very well, and wouldn't be able to handle the case which requires a copy described above, I don't think. (Several optimisation passes are keen to form V*mode subregs of DImode values, even if CANNOT_CHANGE_MODE_CLASS/MODES_TIEABLE_P are tweaked. The hooks/macros controlling argument & function-return promotion appear to get some of the way there to implementing the RTL "solution", but evidently not far enough.) So, I think the proper way of implementing this is probably at the tree level -- maybe rewriting vector types in function argument lists to "opaque" vectors, like e.g. rs6000 uses for some intrinsics, and inserting machine-dependent operations for marshalling and unmarshalling at appropriate points -- maybe still using DImode/TImode to represent containerized (opaque) vectors at the RTL level, or maybe introducing new machine modes if that doesn't work reliably. The two main advantages of this approach over the status quo are: 1. Big-endian mode works as well as little-endian mode for NEON -- intrinsics, vectorization, the lot. 2. Even in little-endian mode, using vld1/vst1 predominantly over vldr/vstr means that the alignment hints in those instructions can be used more often, which might be a minor performance boost. Would this be a sensible approach, or am I completely wrong? I'm not sure if I can dedicate time to implementing it at the moment in any case. Maybe someone within ARM (or Linaro) could take it up? ;-) (Anyway, I still think it might be a good idea to apply the original patch until such work is done, considering vectorization -- enabled at -O3 -- is broken with NEON turned on in big-endian mode at the moment.) Thanks, Julian
> > > I can't exactly remember why we didn't do that to start with. I > > > think the problem was ABI-related, or to do with transferring NEON > > > vectors to/from ARM registers when it was necessary to do that... > > > I'm planning to do some archaeology to try to see if I can figure > > > out a definitive answer. > > > > The ABI defined vector types (uint32x4_t etc) are defined to be in > > vldm/vstm order. > > There's no conflict with the ABI-defined vector order -- the ABI > (looking at AAPCS, IHI 0042D) describes "containerized" vectors which > should be used to pass and return vector quantities at ABI boundaries, > but I couldn't find any further restrictions. Internally to a function, > we are still free to use vld1/vst1 vector ordering. Using > "containerized"/opaque transfers, the bit pattern of a vector in one > function (using vld1/vst1 ordering internally) will of course remain > unchanged if passed to another function and using the same ordering > there also. Ah, ok. If you make the ABI defined types distinct from the GCC generic vector types (as used by the vectorizer), then in principle that should work. I agree that current GCC probably does not have the infrastructure to do that, and some of the vector code plays a bit fast and loose with type conversions/subregs. Remember that it's not just function arguments, it's any interface shared between functions. i.e. including structures and global variables. > Actually making that work (especially efficiently) with GCC is a > slightly different matter. Let's call vldm/vstm-ordered vectors > "containerized" format, and vld1/vst1-ordered vectors "array" format. We > need to do introduce the concept of marshalling vector arguments from > array format to containerized format when passing them to a function, > and unmarshalling those vector arguments back the other way on function > entry. AFAICT, GCC does not have suitable infrastructure for > implementing such functionality at present: consider that e.g. vectors > passed by value on the stack should use containerized format, which > means the called function cannot simply dereference the stack pointer > to read the vector: IIRC I/we tried to do something very similar (possibly the other way around) by abusing the unaligned load mechanism. I don't remember why that failed. Paul
On Mon, 4 Mar 2013 13:08:57 +0000 Paul Brook <paul@codesourcery.com> wrote: > > > > I can't exactly remember why we didn't do that to start with. I > > > > think the problem was ABI-related, or to do with transferring > > > > NEON vectors to/from ARM registers when it was necessary to do > > > > that... I'm planning to do some archaeology to try to see if I > > > > can figure out a definitive answer. > > > > > > The ABI defined vector types (uint32x4_t etc) are defined to be in > > > vldm/vstm order. > > > > There's no conflict with the ABI-defined vector order -- the ABI > > (looking at AAPCS, IHI 0042D) describes "containerized" vectors > > which should be used to pass and return vector quantities at ABI > > boundaries, but I couldn't find any further restrictions. > > Internally to a function, we are still free to use vld1/vst1 vector > > ordering. Using "containerized"/opaque transfers, the bit pattern > > of a vector in one function (using vld1/vst1 ordering internally) > > will of course remain unchanged if passed to another function and > > using the same ordering there also. > > Ah, ok. If you make the ABI defined types distinct from the GCC > generic vector types (as used by the vectorizer), then in principle > that should work. I agree that current GCC probably does not have the > infrastructure to do that, and some of the vector code plays a bit > fast and loose with type conversions/subregs. (Subregs use memory ordering for the byte offset, so I think those are OK if we use array-order loads/stores pervasively. I'm not 100% sure though...) > Remember that it's not just function arguments, it's any interface > shared between functions. i.e. including structures and global > variables. Ugh, I hadn't considered structures or global variables :-/. If we decide they have to use the containerized format also, then we lose a lot of the supposed advantage of using array-format vectors "everywhere" (apart from at procedure call boundaries), for instance if we want code with a global variable like: union { char myarr[8]; v8qi myvec; } foo; to do the right thing (i.e., with elements of myvec corresponding one-to-one to elements of myarr), then using the containerized format for accesses to myvec would be a non-starter. Skimming the AAPCS, I'm not sure it actually specifies anything about the layout of global variables which may be shared between functions (it'd make sense to do so -- maybe it's elsewhere in the EABI documents). Aggregates passed by value could also be marshalled/unmarshalled like vectors, though that starts to sound much less tractable than dealing with vectors alone. > > Actually making that work (especially efficiently) with GCC is a > > slightly different matter. Let's call vldm/vstm-ordered vectors > > "containerized" format, and vld1/vst1-ordered vectors "array" > > format. We need to do introduce the concept of marshalling vector > > arguments from array format to containerized format when passing > > them to a function, and unmarshalling those vector arguments back > > the other way on function entry. AFAICT, GCC does not have suitable > > infrastructure for implementing such functionality at present: > > consider that e.g. vectors passed by value on the stack should use > > containerized format, which means the called function cannot simply > > dereference the stack pointer to read the vector: > > IIRC I/we tried to do something very similar (possibly the other way > around) by abusing the unaligned load mechanism. I don't remember > why that failed. That'd be this conversation: http://gcc.gnu.org/ml/gcc-patches/2009-11/msg00876.html we only tweaked the vectorizer to always use movmisalign, leaving intrinsics & generic vectors using vldm/vstm order. Fixing-up the resulting chaos using ad-hoc hacks didn't go down too well with maintainers, so the patch fizzled out. Cheers, Julian
On Mon, 4 Mar 2013 15:29:22 +0000 Julian Brown <julian@codesourcery.com> wrote: > > Remember that it's not just function arguments, it's any interface > > shared between functions. i.e. including structures and global > > variables. > > Ugh, I hadn't considered structures or global variables :-/. If we > decide they have to use the containerized format also, then we lose a > lot of the supposed advantage of using array-format vectors > "everywhere" (apart from at procedure call boundaries), for instance > if we want code with a global variable like: > [...] > Skimming the AAPCS, I'm not sure it actually specifies anything about > the layout of global variables which may be shared between functions > (it'd make sense to do so -- maybe it's elsewhere in the EABI > documents). Aggregates passed by value could also be > marshalled/unmarshalled like vectors, though that starts to sound much > less tractable than dealing with vectors alone. I somehow missed the "Appendix A: Support for Advanced SIMD Extensions" in the AAPCS document (it's not in the TOC!). It looks like the builtin vector types are indeed defined to be stored in memory in vldm/vstm order -- I think that means we're back to square one. So: thoughts on disabling vectorization altogether in big-endian mode? Julian
> I somehow missed the "Appendix A: Support for Advanced SIMD Extensions" > in the AAPCS document (it's not in the TOC!). It looks like the > builtin vector types are indeed defined to be stored in memory in > vldm/vstm order -- I think that means we're back to square one. There's still the possibility of making gcc "generic" vector types different from the ABI specified types[1], but that feels like it's probably a really bad idea. Having a distinct set of types just for the vectorizer may be a more viable option. IIRC the type selection hooks are more flexible than when we first looked at this problem. Paul [1] e.g. int gcc __attribute__((vector_size(8))); v.s. int32x2_t eabi;
On Tue, Mar 5, 2013 at 12:47 AM, Paul Brook <paul@codesourcery.com> wrote: >> I somehow missed the "Appendix A: Support for Advanced SIMD Extensions" >> in the AAPCS document (it's not in the TOC!). It looks like the >> builtin vector types are indeed defined to be stored in memory in >> vldm/vstm order -- I think that means we're back to square one. > > There's still the possibility of making gcc "generic" vector types different > from the ABI specified types[1], but that feels like it's probably a really > bad idea. > > Having a distinct set of types just for the vectorizer may be a more viable > option. IIRC the type selection hooks are more flexible than when we first > looked at this problem. > > Paul > > [1] e.g. int gcc __attribute__((vector_size(8))); v.s. int32x2_t eabi; I think int32x2_t should not be a GCC vector type (thus not have a vector mode). The ABI specified types should map to an integer mode of the right size instead. The vectorizer would then still use internal GCC vector types and modes and the backend needs to provide instruction patterns that do the right thing with the element ordering the vectorizer expects. How are the int32x2_t types used? I suppose they are arguments to the intrinsics. Which means that for _most_ operations element order does not matter, thus a plus32x2 (int32x2_t x, int32x2_t y) can simply use the equivalent of return (int32x2_t)((gcc_int32x2_t)x + (gcc_int32x2_t)y). In intrinsics where order matters you'd insert appropriate __builtin_shuffle()s. Oh, of course do the above only for big-endian mode ... The other way around, mapping intrinsics and ABI vectors to vector modes will have issues ... you'd have to guard all optab queries in the middle-end to fail for arm big-endian as they expect instruction patterns that deal with the GCC vector ordering. Thus: model the backend after GCCs expectations and "fixup" the rest by fixing the ABI types and intrinsics. Richard.
On Tue, 5 Mar 2013 10:42:59 +0100 Richard Biener <richard.guenther@gmail.com> wrote: > On Tue, Mar 5, 2013 at 12:47 AM, Paul Brook <paul@codesourcery.com> > wrote: > >> I somehow missed the "Appendix A: Support for Advanced SIMD > >> Extensions" in the AAPCS document (it's not in the TOC!). It looks > >> like the builtin vector types are indeed defined to be stored in > >> memory in vldm/vstm order -- I think that means we're back to > >> square one. > > > > There's still the possibility of making gcc "generic" vector types > > different from the ABI specified types[1], but that feels like it's > > probably a really bad idea. > > > > Having a distinct set of types just for the vectorizer may be a > > more viable option. IIRC the type selection hooks are more flexible > > than when we first looked at this problem. > > > > Paul > > > > [1] e.g. int gcc __attribute__((vector_size(8))); v.s. int32x2_t > > eabi; > > I think int32x2_t should not be a GCC vector type (thus not have a > vector mode). The ABI specified types should map to an integer mode > of the right size instead. The vectorizer would then still use > internal GCC vector types and modes and the backend needs to provide > instruction patterns that do the right thing with the element > ordering the vectorizer expects. > > How are the int32x2_t types used? I suppose they are arguments to > the intrinsics. Which means that for _most_ operations element order > does not matter, thus a plus32x2 (int32x2_t x, int32x2_t y) can simply > use the equivalent of return (int32x2_t)((gcc_int32x2_t)x + > (gcc_int32x2_t)y). In intrinsics where order matters you'd insert > appropriate __builtin_shuffle()s. Maybe there's no need to interpret the vector layout for any of the intrinsics -- just treat all inputs & outputs as opaque (there are intrinsics for getting/setting lanes -- IMO these shouldn't attempt to convert lane numbers at all, though they do at present). Several intrinsics are currently implemented using __builtin_shuffle, e.g.: __extension__ static __inline int8x8_t __attribute__ ((__always_inline__)) vrev64_s8 (int8x8_t __a) { return (int8x8_t) __builtin_shuffle (__a, (uint8x8_t) { 7, 6, 5, 4, 3, 2, 1, 0 }); } I'd imagine that if int8x8_t are not actual vector types, we could invent extra builtins to convert them to and from such types to be able to still do this kind of thing (in arm_neon.h, not necessarily for direct use by users), i.e.: typedef char gcc_int8x8_t __attribute__((vector_size(8))); int8x8_t vrev64_s8 (int8x8_t __a) { gcc_int8x8_t tmp = __builtin_neon2generic (__a); tmp = __builtin_shuffle (tmp, (gcc_int8x8_t) { 7, 6, 5, 4, ... }); return __builtin_generic2neon (tmp); } (On re-reading, that's basically the same as what you suggested, I think.) > Oh, of course do the above only for big-endian mode ... > > The other way around, mapping intrinsics and ABI vectors to vector > modes will have issues ... you'd have to guard all optab queries in > the middle-end to fail for arm big-endian as they expect instruction > patterns that deal with the GCC vector ordering. > > Thus: model the backend after GCCs expectations and "fixup" the rest > by fixing the ABI types and intrinsics. I think this plan will work fine -- it has the added advantage (which looks like a disadvantage, but really isn't) that generic vector operations like: void foo (void) { int8x8_t x = { 0, 1, 2, 3, 4, 5, 6, 7 }; } will *not* work -- nor will e.g. subscripting ABI-defined vectors using []s. At the moment using these features can lead to surprising results. Unfortunately NEON's pretty complicated, and the ARM backend currently uses vector modes quite heavily implementing it, so just using integer modes for intrinsics is going to be tough. It might work to create a shadow set of vector modes for use only by the intrinsics (O*mode for "opaque" instead of V*mode, say), if the middle end won't barf at that. Thanks, Julian
Julian Brown wrote: > On Tue, 5 Mar 2013 10:42:59 +0100 > Richard Biener <richard.guenther@gmail.com> wrote: > >> On Tue, Mar 5, 2013 at 12:47 AM, Paul Brook <paul@codesourcery.com> >> wrote: >>>> I somehow missed the "Appendix A: Support for Advanced SIMD >>>> Extensions" in the AAPCS document (it's not in the TOC!). It looks >>>> like the builtin vector types are indeed defined to be stored in >>>> memory in vldm/vstm order -- I think that means we're back to >>>> square one. >>> There's still the possibility of making gcc "generic" vector types >>> different from the ABI specified types[1], but that feels like it's >>> probably a really bad idea. >>> >>> Having a distinct set of types just for the vectorizer may be a >>> more viable option. IIRC the type selection hooks are more flexible >>> than when we first looked at this problem. >>> >>> Paul >>> >>> [1] e.g. int gcc __attribute__((vector_size(8))); v.s. int32x2_t >>> eabi; >> I think int32x2_t should not be a GCC vector type (thus not have a >> vector mode). The ABI specified types should map to an integer mode >> of the right size instead. The vectorizer would then still use >> internal GCC vector types and modes and the backend needs to provide >> instruction patterns that do the right thing with the element >> ordering the vectorizer expects. >> >> How are the int32x2_t types used? I suppose they are arguments to >> the intrinsics. Which means that for _most_ operations element order >> does not matter, thus a plus32x2 (int32x2_t x, int32x2_t y) can simply >> use the equivalent of return (int32x2_t)((gcc_int32x2_t)x + >> (gcc_int32x2_t)y). In intrinsics where order matters you'd insert >> appropriate __builtin_shuffle()s. > > Maybe there's no need to interpret the vector layout for any of the > intrinsics -- just treat all inputs & outputs as opaque (there are > intrinsics for getting/setting lanes -- IMO these shouldn't attempt to > convert lane numbers at all, though they do at present). Several > intrinsics are currently implemented using __builtin_shuffle, e.g.: > > __extension__ static __inline int8x8_t __attribute__ ((__always_inline__)) > vrev64_s8 (int8x8_t __a) > { > return (int8x8_t) __builtin_shuffle (__a, (uint8x8_t) { 7, 6, 5, 4, 3, 2, 1, 0 }); > } > > I'd imagine that if int8x8_t are not actual vector types, we could > invent extra builtins to convert them to and from such types to be able > to still do this kind of thing (in arm_neon.h, not necessarily for > direct use by users), i.e.: > > typedef char gcc_int8x8_t __attribute__((vector_size(8))); > > int8x8_t > vrev64_s8 (int8x8_t __a) > { > gcc_int8x8_t tmp = __builtin_neon2generic (__a); > tmp = __builtin_shuffle (tmp, (gcc_int8x8_t) { 7, 6, 5, 4, ... }); > return __builtin_generic2neon (tmp); > } > > (On re-reading, that's basically the same as what you suggested, I > think.) > >> Oh, of course do the above only for big-endian mode ... >> >> The other way around, mapping intrinsics and ABI vectors to vector >> modes will have issues ... you'd have to guard all optab queries in >> the middle-end to fail for arm big-endian as they expect instruction >> patterns that deal with the GCC vector ordering. >> >> Thus: model the backend after GCCs expectations and "fixup" the rest >> by fixing the ABI types and intrinsics. > > I think this plan will work fine -- it has the added advantage (which > looks like a disadvantage, but really isn't) that generic vector > operations like: > > void foo (void) > { > int8x8_t x = { 0, 1, 2, 3, 4, 5, 6, 7 }; > } > > will *not* work -- nor will e.g. subscripting ABI-defined vectors using > []s. At the moment using these features can lead to surprising results. > > Unfortunately NEON's pretty complicated, and the ARM backend currently > uses vector modes quite heavily implementing it, so just using integer > modes for intrinsics is going to be tough. It might work to create a > shadow set of vector modes for use only by the intrinsics (O*mode for > "opaque" instead of V*mode, say), if the middle end won't barf at that. I suspect the mid-end may not be too happy with opaque modes for vectors. I've faced some issues in the past while experimenting with large int modes for vector register lists while implementing permuted loads in AArch64 particularly in the area of subreg generation where SUBREG_BYTE is generated based on BITS_PER_WORD for all INT mode classes not taking into account which registers the values of the particular mode end up in. This causes subreg_bytes to be unaligned to vector register boundary. To illustrate this, here is an example that exposed this issue: For aarch64, I mirrored the approach that the arm/thumb backend employs and defined 'large int' opaque modes to represent the register lists i.e. OImode, CImode and XImode and defined the standard patterns that implement permuted load/stores - vec_store_lanes<INT_MODE><VEC_MODE> and vec_load_lanes<INT_MODE><VEC_MODE>. At the time, I remember this test case typedef unsigned short V __attribute__((vector_size(32))); typedef V VI; V in = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 }; VI mask = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, }; V out = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 }; extern void bar(V); int main() { V r = __builtin_shuffle(in, mask); bar (r); } generated this RTL with my experimental compiler: ... (insn 65 59 61 2 (set (reg:DI 178) (and:DI (ashift:DI (subreg:DI (reg:OI 74 [ mask.3 ]) 8) (const_int 1 [0x1])) (const_int 30 [0x1e]))) vs.c:24 380 {*andim_ashiftdi_bfiz} (nil)) ... (insn 151 145 147 2 (set (reg:DI 256) (and:DI (ashift:DI (subreg:DI (reg:OI 74 [ mask.3 ]) 24) (const_int 1 [0x1])) (const_int 30 [0x1e]))) vs.c:24 380 {*andim_ashiftdi_bfiz} (nil)) .... which is the short value extraction out of the vectors. I ran into this situation where the subregs were generated with byte offsets such that byte_offset % UNITS_PER_VREG != 0 i.e. subreg offsets that were not aligned to the vector register boundary. The above dump is before the reload phase. During reload subreg elimination, these subregs were converted to refer to the incorrect part of vector registers. Though OImode is a large INT mode, we force these modes only to live in FPSIMD registers for which the UNITS_PER_VREG or BITS_PER_WORD is different from the integer word size i.e. UNITS_PER_VREG is 16 and BITS_PER_WORD for FPSIMD is 128. I discovered in the mid-end that subregs were generated using BITS_PER_WORD and there weren't checks during generation to see that BITS_PER_WORD could be dependent on the mode which the subreg is being generated for. There was an assumption that BITS_PER_WORD applied to all INT modes. In this case, because OImode was only allowed in FPSIMD regs, BITS_PER_WORD should've been 128 or in other words mode-dependent. In general, shouldn't BITS_PER_WORD be dependent on the registers that a particular mode ultimately ends up in dictated by the target hook HARD_REGNO_MODE_OK? As far as I can see, expmed.c:store_bit_field_1 () hasn't changed much in this respect and I suspect this issue still remains. We don't have the same issue on the ARM backend because the basic unit of register allocation is 32-bits for both FP and Int units(arm.h #define ARM_NUM_INTS) and the FP unit is a register-packing architecture. That was in the context of register lists where large opaque int modes represent more than one 1 vector register. As you suggest, if we extend opaque int modes to represent 1 vector register, and with SUBREG being generated independent of modes in the mid-end, I imagine this may cause pain for later phases(like reload subreg elimination). But that said, I'm not an expert on how mid-end handles opaque int modes and things might have improved in the area of SUBREG generation since my experiments. Thanks, Tejas Belagod ARM.
Index: gcc/testsuite/gcc.dg/vect/slp-cond-3.c =================================================================== --- gcc/testsuite/gcc.dg/vect/slp-cond-3.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/slp-cond-3.c (working copy) @@ -79,6 +79,6 @@ int main () return 0; } -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { ! vect_unpack } } } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/vect-1.c =================================================================== --- gcc/testsuite/gcc.dg/vect/vect-1.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/vect-1.c (working copy) @@ -86,5 +86,5 @@ foo (int n) } /* { dg-final { scan-tree-dump-times "vectorized 6 loops" 1 "vect" { target vect_strided2 } } } */ -/* { dg-final { scan-tree-dump-times "vectorized 5 loops" 1 "vect" { xfail vect_strided2 } } } */ +/* { dg-final { scan-tree-dump-times "vectorized 5 loops" 1 "vect" { xfail { vect_strided2 || { ! vect_unpack } } } } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/slp-cond-4.c =================================================================== --- gcc/testsuite/gcc.dg/vect/slp-cond-4.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/slp-cond-4.c (working copy) @@ -82,5 +82,5 @@ int main () return 0; } -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { ! vect_unpack } } } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/vect-1-big-array.c =================================================================== --- gcc/testsuite/gcc.dg/vect/vect-1-big-array.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/vect-1-big-array.c (working copy) @@ -86,5 +86,5 @@ foo (int n) } /* { dg-final { scan-tree-dump-times "vectorized 6 loops" 1 "vect" { target vect_strided2 } } } */ -/* { dg-final { scan-tree-dump-times "vectorized 5 loops" 1 "vect" { xfail vect_strided2 } } } */ +/* { dg-final { scan-tree-dump-times "vectorized 5 loops" 1 "vect" { xfail { vect_strided2 || { ! vect_unpack } } } } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/vect-35.c =================================================================== --- gcc/testsuite/gcc.dg/vect/vect-35.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/vect-35.c (working copy) @@ -45,6 +45,6 @@ int main (void) } -/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { xfail { ia64-*-* sparc*-*-* } } } } */ +/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { xfail { { ia64-*-* sparc*-*-* } || { ! vect_unpack } } } } } */ /* { dg-final { scan-tree-dump "can't determine dependence between" "vect" } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/vect-over-widen-3-big-array.c =================================================================== --- gcc/testsuite/gcc.dg/vect/vect-over-widen-3-big-array.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/vect-over-widen-3-big-array.c (working copy) @@ -59,6 +59,6 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 1 "vect" } } */ -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_element_align } } } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/vect-widen-shift-u16.c =================================================================== --- gcc/testsuite/gcc.dg/vect/vect-widen-shift-u16.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/vect-widen-shift-u16.c (working copy) @@ -53,6 +53,6 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vect_recog_widen_shift_pattern: detected" 1 "vect" { target vect_widen_shift } } } */ -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_widen_shift } && { ! vect_unpack } } } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/bb-slp-26.c =================================================================== --- gcc/testsuite/gcc.dg/vect/bb-slp-26.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/bb-slp-26.c (working copy) @@ -55,6 +55,6 @@ int main (void) return 0; } -/* { dg-final { scan-tree-dump-times "basic block vectorized using SLP" 1 "slp" { target vect64 } } } */ +/* { dg-final { scan-tree-dump-times "basic block vectorized using SLP" 1 "slp" { target vect64 xfail { ! vect_unpack } } } } */ /* { dg-final { cleanup-tree-dump "slp" } } */ Index: gcc/testsuite/gcc.dg/vect/vect-over-widen-1.c =================================================================== --- gcc/testsuite/gcc.dg/vect/vect-over-widen-1.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/vect-over-widen-1.c (working copy) @@ -62,6 +62,6 @@ int main (void) /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 2 "vect" { target vect_widen_shift } } } */ /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 4 "vect" { target { { ! vect_sizes_32B_16B } && { ! vect_widen_shift } } } } } */ /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 8 "vect" { target vect_sizes_32B_16B } } } */ -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_element_align } } } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/vect-35-big-array.c =================================================================== --- gcc/testsuite/gcc.dg/vect/vect-35-big-array.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/vect-35-big-array.c (working copy) @@ -45,6 +45,6 @@ int main (void) } -/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { xfail { ia64-*-* sparc*-*-* } } } } */ +/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { xfail { { ia64-*-* sparc*-*-* } || { ! vect_unpack } } } } } */ /* { dg-final { scan-tree-dump-times "can't determine dependence between" 1 "vect" } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/vect-over-widen-2.c =================================================================== --- gcc/testsuite/gcc.dg/vect/vect-over-widen-2.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/vect-over-widen-2.c (working copy) @@ -60,6 +60,6 @@ int main (void) /* Final value stays in int, so no over-widening is detected at the moment. */ /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 0 "vect" } } */ -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_element_align } } } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/pr43430-2.c =================================================================== --- gcc/testsuite/gcc.dg/vect/pr43430-2.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/pr43430-2.c (working copy) @@ -12,5 +12,5 @@ vsad16_c (void *c, uint8_t * s1, uint8_t return score; } -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_condition } } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_condition && vect_element_align } } } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/vect-widen-shift-s8.c =================================================================== --- gcc/testsuite/gcc.dg/vect/vect-widen-shift-s8.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/vect-widen-shift-s8.c (working copy) @@ -53,6 +53,6 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vect_recog_widen_shift_pattern: detected" 1 "vect" { target vect_widen_shift } } } */ -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_widen_shift } && { ! vect_unpack } } } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/vect-over-widen-1-big-array.c =================================================================== --- gcc/testsuite/gcc.dg/vect/vect-over-widen-1-big-array.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/vect-over-widen-1-big-array.c (working copy) @@ -61,6 +61,6 @@ int main (void) /* { dg-final { scan-tree-dump-times "vect_recog_widen_shift_pattern: detected" 2 "vect" { target vect_widen_shift } } } */ /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 2 "vect" { target vect_widen_shift } } } */ /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 4 "vect" { target { ! vect_widen_shift } } } } */ -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_element_align } } } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/vect-over-widen-3.c =================================================================== --- gcc/testsuite/gcc.dg/vect/vect-over-widen-3.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/vect-over-widen-3.c (working copy) @@ -59,6 +59,6 @@ int main (void) } /* { dg-final { scan-tree-dump "vect_recog_over_widening_pattern: detected" "vect" } } */ -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_element_align } } } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/vect-over-widen-4.c =================================================================== --- gcc/testsuite/gcc.dg/vect/vect-over-widen-4.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/vect-over-widen-4.c (working copy) @@ -66,6 +66,6 @@ int main (void) /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 2 "vect" { target vect_widen_shift } } } */ /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 4 "vect" { target { { ! vect_sizes_32B_16B } && { ! vect_widen_shift } } } } } */ /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 8 "vect" { target vect_sizes_32B_16B } } } */ -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_element_align } } } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/vect-over-widen-4-big-array.c =================================================================== --- gcc/testsuite/gcc.dg/vect/vect-over-widen-4-big-array.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/vect-over-widen-4-big-array.c (working copy) @@ -65,6 +65,6 @@ int main (void) /* { dg-final { scan-tree-dump-times "vect_recog_widen_shift_pattern: detected" 2 "vect" { target vect_widen_shift } } } */ /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 2 "vect" { target vect_widen_shift } } } */ /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 4 "vect" { target { ! vect_widen_shift } } } } */ -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_element_align } } } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/vect-93.c =================================================================== --- gcc/testsuite/gcc.dg/vect/vect-93.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/vect-93.c (working copy) @@ -79,7 +79,7 @@ int main (void) /* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target vect_no_align } } } */ /* in main: */ -/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target vect_no_align } } } */ +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { vect_no_align && { ! vect_intfloat_cvt } } } } } */ /* { dg-final { scan-tree-dump-times "Vectorizing an unaligned access" 1 "vect" { xfail { vect_no_align } } } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/vect-widen-shift-u8.c =================================================================== --- gcc/testsuite/gcc.dg/vect/vect-widen-shift-u8.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/vect-widen-shift-u8.c (working copy) @@ -60,5 +60,5 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vect_recog_widen_shift_pattern: detected" 2 "vect" { target vect_widen_shift } } } */ -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_widen_shift } && { ! vect_unpack } } } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/vect-intfloat-conversion-4a.c =================================================================== --- gcc/testsuite/gcc.dg/vect/vect-intfloat-conversion-4a.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/vect-intfloat-conversion-4a.c (working copy) @@ -35,5 +35,5 @@ int main (void) return main1 (); } -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_intfloat_cvt } } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_intfloat_cvt && vect_unpack } } } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/vect-intfloat-conversion-4b.c =================================================================== --- gcc/testsuite/gcc.dg/vect/vect-intfloat-conversion-4b.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/vect-intfloat-conversion-4b.c (working copy) @@ -35,5 +35,5 @@ int main (void) return main1 (); } -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_intfloat_cvt } } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_intfloat_cvt && vect_unpack } } } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/vect-over-widen-2-big-array.c =================================================================== --- gcc/testsuite/gcc.dg/vect/vect-over-widen-2-big-array.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/vect-over-widen-2-big-array.c (working copy) @@ -60,6 +60,6 @@ int main (void) /* Final value stays in int, so no over-widening is detected at the moment. */ /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 0 "vect" } } */ -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_element_align } } } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/bb-slp-11.c =================================================================== --- gcc/testsuite/gcc.dg/vect/bb-slp-11.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/bb-slp-11.c (working copy) @@ -48,6 +48,6 @@ int main (void) return 0; } -/* { dg-final { scan-tree-dump-times "basic block vectorized using SLP" 1 "slp" { target vect64 } } } */ +/* { dg-final { scan-tree-dump-times "basic block vectorized using SLP" 1 "slp" { target vect64 xfail { ! vect_unpack } } } } */ /* { dg-final { cleanup-tree-dump "slp" } } */ Index: gcc/testsuite/gcc.dg/vect/vect-widen-shift-s16.c =================================================================== --- gcc/testsuite/gcc.dg/vect/vect-widen-shift-s16.c (revision 196170) +++ gcc/testsuite/gcc.dg/vect/vect-widen-shift-s16.c (working copy) @@ -102,6 +102,6 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vect_recog_widen_shift_pattern: detected" 8 "vect" { target vect_widen_shift } } } */ -/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { xfail { ! vect_widen_shift } && { ! vect_unpack } } } } */ /* { dg-final { cleanup-tree-dump "vect" } } */ Index: gcc/testsuite/lib/target-supports.exp =================================================================== --- gcc/testsuite/lib/target-supports.exp (revision 196170) +++ gcc/testsuite/lib/target-supports.exp (working copy) @@ -3089,7 +3089,8 @@ proc check_effective_target_vect_perm { verbose "check_effective_target_vect_perm: using cached result" 2 } else { set et_vect_perm_saved 0 - if { [is-effective-target arm_neon_ok] + if { ([is-effective-target arm_neon_ok] + && [is-effective-target arm_little_endian]) || [istarget aarch64*-*-*] || [istarget powerpc*-*-*] || [istarget spu-*-*] @@ -3211,7 +3212,8 @@ proc check_effective_target_vect_widen_s } else { set et_vect_widen_sum_qi_to_hi_saved 0 if { [check_effective_target_vect_unpack] - || [check_effective_target_arm_neon_ok] + || ([check_effective_target_arm_neon_ok] + && [check_effective_target_arm_little_endian]) || [istarget ia64-*-*] } { set et_vect_widen_sum_qi_to_hi_saved 1 } @@ -3263,7 +3265,8 @@ proc check_effective_target_vect_widen_m } if { [istarget powerpc*-*-*] || [istarget aarch64*-*-*] - || ([istarget arm*-*-*] && [check_effective_target_arm_neon_ok]) } { + || ([istarget arm*-*-*] && [check_effective_target_arm_neon_ok] + && [check_effective_target_arm_little_endian]) } { set et_vect_widen_mult_qi_to_hi_saved 1 } } @@ -3298,7 +3301,8 @@ proc check_effective_target_vect_widen_m || [istarget aarch64*-*-*] || [istarget i?86-*-*] || [istarget x86_64-*-*] - || ([istarget arm*-*-*] && [check_effective_target_arm_neon_ok]) } { + || ([istarget arm*-*-*] && [check_effective_target_arm_neon_ok] + && [check_effective_target_arm_little_endian]) } { set et_vect_widen_mult_hi_to_si_saved 1 } } @@ -3368,7 +3372,8 @@ proc check_effective_target_vect_widen_s verbose "check_effective_target_vect_widen_shift: using cached result" 2 } else { set et_vect_widen_shift_saved 0 - if { ([istarget arm*-*-*] && [check_effective_target_arm_neon_ok]) } { + if { ([istarget arm*-*-*] && [check_effective_target_arm_neon_ok] + && [check_effective_target_arm_little_endian]) } { set et_vect_widen_shift_saved 1 } } @@ -3859,7 +3864,8 @@ proc check_effective_target_vect_extract set et_vect_extract_even_odd_saved 0 if { [istarget aarch64*-*-*] || [istarget powerpc*-*-*] - || [is-effective-target arm_neon_ok] + || ([is-effective-target arm_neon_ok] + && [is-effective-target arm_little_endian]) || [istarget i?86-*-*] || [istarget x86_64-*-*] || [istarget ia64-*-*] @@ -3885,7 +3891,8 @@ proc check_effective_target_vect_interle set et_vect_interleave_saved 0 if { [istarget aarch64*-*-*] || [istarget powerpc*-*-*] - || [is-effective-target arm_neon_ok] + || ([is-effective-target arm_neon_ok] + && [is-effective-target arm_little_endian]) || [istarget i?86-*-*] || [istarget x86_64-*-*] || [istarget ia64-*-*] @@ -3915,7 +3922,8 @@ foreach N {2 3 4 8} { && [check_effective_target_vect_extract_even_odd] } { set et_vect_stridedN_saved 1 } - if { ([istarget arm*-*-*] + if { (([istarget arm*-*-*] && [is-effective-target arm_neon_ok] + && [is-effective-target arm_little_endian]) || [istarget aarch64*-*-*]) && N >= 2 && N <= 4 } { set et_vect_stridedN_saved 1 } @@ -3934,7 +3942,8 @@ proc check_effective_target_vect_multipl set et_vect_multiple_sizes_saved 0 if { ([istarget aarch64*-*-*] - || ([istarget arm*-*-*] && [check_effective_target_arm_neon_ok])) } { + || ([istarget arm*-*-*] && [check_effective_target_arm_neon_ok] + && [check_effective_target_arm_little_endian])) } { set et_vect_multiple_sizes_saved 1 } if { ([istarget x86_64-*-*] || [istarget i?86-*-*]) } { @@ -3957,8 +3966,7 @@ proc check_effective_target_vect64 { } { } else { set et_vect64_saved 0 if { ([istarget arm*-*-*] - && [check_effective_target_arm_neon_ok] - && [check_effective_target_arm_little_endian]) } { + && [check_effective_target_arm_neon_ok]) } { set et_vect64_saved 1 } } Index: gcc/config/arm/arm.c =================================================================== --- gcc/config/arm/arm.c (revision 196170) +++ gcc/config/arm/arm.c (working copy) @@ -25041,7 +25041,7 @@ static bool arm_array_mode_supported_p (enum machine_mode mode, unsigned HOST_WIDE_INT nelems) { - if (TARGET_NEON + if (TARGET_NEON && !BYTES_BIG_ENDIAN && (VALID_NEON_DREG_MODE (mode) || VALID_NEON_QREG_MODE (mode)) && (nelems >= 2 && nelems <= 4)) return true; @@ -25057,23 +25057,27 @@ static enum machine_mode arm_preferred_simd_mode (enum machine_mode mode) { if (TARGET_NEON) - switch (mode) - { - case SFmode: - return TARGET_NEON_VECTORIZE_DOUBLE ? V2SFmode : V4SFmode; - case SImode: - return TARGET_NEON_VECTORIZE_DOUBLE ? V2SImode : V4SImode; - case HImode: - return TARGET_NEON_VECTORIZE_DOUBLE ? V4HImode : V8HImode; - case QImode: - return TARGET_NEON_VECTORIZE_DOUBLE ? V8QImode : V16QImode; - case DImode: - if (!TARGET_NEON_VECTORIZE_DOUBLE) - return V2DImode; - break; + { + bool double_only = BYTES_BIG_ENDIAN || TARGET_NEON_VECTORIZE_DOUBLE; - default:; - } + switch (mode) + { + case SFmode: + return double_only ? V2SFmode : V4SFmode; + case SImode: + return double_only ? V2SImode : V4SImode; + case HImode: + return double_only ? V4HImode : V8HImode; + case QImode: + return double_only ? V8QImode : V16QImode; + case DImode: + if (!double_only) + return V2DImode; + break; + + default:; + } + } if (TARGET_REALLY_IWMMXT) switch (mode) @@ -25974,6 +25978,11 @@ arm_vector_alignment (const_tree type) static unsigned int arm_autovectorize_vector_sizes (void) { + /* Use of quad-word registers for autovectorization for NEON is fraught with + difficulties. Just don't do that. */ + if (TARGET_NEON && BYTES_BIG_ENDIAN) + return 8; + return TARGET_NEON_VECTORIZE_DOUBLE ? 0 : (16 | 8); } @@ -27008,6 +27017,12 @@ arm_vectorize_vec_perm_const_ok (enum ma unsigned int i, nelt, which; bool ret; + /* FIXME: There appear to be element-numbering problems with vector + permutations in big-endian mode that cause the vectorizer to produce bad + code. Disable for now. */ + if (BYTES_BIG_ENDIAN) + return false; + d.vmode = vmode; d.nelt = nelt = GET_MODE_NUNITS (d.vmode); d.testing_p = true; Index: gcc/config/arm/neon.md =================================================================== --- gcc/config/arm/neon.md (revision 196170) +++ gcc/config/arm/neon.md (working copy) @@ -4506,7 +4506,7 @@ [(set (match_operand:VDQX 0 "s_register_operand") (unspec:VDQX [(match_operand:VDQX 1 "neon_struct_operand")] UNSPEC_VLD1))] - "TARGET_NEON") + "TARGET_NEON && !BYTES_BIG_ENDIAN") (define_insn "neon_vld1<mode>" [(set (match_operand:VDQX 0 "s_register_operand" "=w") @@ -4618,7 +4618,7 @@ [(set (match_operand:VDQX 0 "neon_struct_operand") (unspec:VDQX [(match_operand:VDQX 1 "s_register_operand")] UNSPEC_VST1))] - "TARGET_NEON") + "TARGET_NEON && !BYTES_BIG_ENDIAN") (define_insn "neon_vst1<mode>" [(set (match_operand:VDQX 0 "neon_struct_operand" "=Um") @@ -4683,7 +4683,7 @@ (unspec:TI [(match_operand:TI 1 "neon_struct_operand") (unspec:VDX [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] UNSPEC_VLD2))] - "TARGET_NEON") + "TARGET_NEON && !BYTES_BIG_ENDIAN") (define_insn "neon_vld2<mode>" [(set (match_operand:TI 0 "s_register_operand" "=w") @@ -4708,7 +4708,7 @@ (unspec:OI [(match_operand:OI 1 "neon_struct_operand") (unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] UNSPEC_VLD2))] - "TARGET_NEON") + "TARGET_NEON && !BYTES_BIG_ENDIAN") (define_insn "neon_vld2<mode>" [(set (match_operand:OI 0 "s_register_operand" "=w") @@ -4797,7 +4797,7 @@ (unspec:TI [(match_operand:TI 1 "s_register_operand") (unspec:VDX [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] UNSPEC_VST2))] - "TARGET_NEON") + "TARGET_NEON && !BYTES_BIG_ENDIAN") (define_insn "neon_vst2<mode>" [(set (match_operand:TI 0 "neon_struct_operand" "=Um") @@ -4822,7 +4822,7 @@ (unspec:OI [(match_operand:OI 1 "s_register_operand") (unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] UNSPEC_VST2))] - "TARGET_NEON") + "TARGET_NEON && !BYTES_BIG_ENDIAN") (define_insn "neon_vst2<mode>" [(set (match_operand:OI 0 "neon_struct_operand" "=Um") @@ -4894,7 +4894,7 @@ (unspec:EI [(match_operand:EI 1 "neon_struct_operand") (unspec:VDX [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] UNSPEC_VLD3))] - "TARGET_NEON") + "TARGET_NEON && !BYTES_BIG_ENDIAN") (define_insn "neon_vld3<mode>" [(set (match_operand:EI 0 "s_register_operand" "=w") @@ -4918,7 +4918,7 @@ [(match_operand:CI 0 "s_register_operand") (match_operand:CI 1 "neon_struct_operand") (unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { emit_insn (gen_neon_vld3<mode> (operands[0], operands[1])); DONE; @@ -5068,7 +5068,7 @@ (unspec:EI [(match_operand:EI 1 "s_register_operand") (unspec:VDX [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] UNSPEC_VST3))] - "TARGET_NEON") + "TARGET_NEON && !BYTES_BIG_ENDIAN") (define_insn "neon_vst3<mode>" [(set (match_operand:EI 0 "neon_struct_operand" "=Um") @@ -5091,7 +5091,7 @@ [(match_operand:CI 0 "neon_struct_operand") (match_operand:CI 1 "s_register_operand") (unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { emit_insn (gen_neon_vst3<mode> (operands[0], operands[1])); DONE; @@ -5213,7 +5213,7 @@ (unspec:OI [(match_operand:OI 1 "neon_struct_operand") (unspec:VDX [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] UNSPEC_VLD4))] - "TARGET_NEON") + "TARGET_NEON && !BYTES_BIG_ENDIAN") (define_insn "neon_vld4<mode>" [(set (match_operand:OI 0 "s_register_operand" "=w") @@ -5237,7 +5237,7 @@ [(match_operand:XI 0 "s_register_operand") (match_operand:XI 1 "neon_struct_operand") (unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { emit_insn (gen_neon_vld4<mode> (operands[0], operands[1])); DONE; @@ -5394,7 +5394,7 @@ (unspec:OI [(match_operand:OI 1 "s_register_operand") (unspec:VDX [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] UNSPEC_VST4))] - "TARGET_NEON") + "TARGET_NEON && !BYTES_BIG_ENDIAN") (define_insn "neon_vst4<mode>" [(set (match_operand:OI 0 "neon_struct_operand" "=Um") @@ -5418,7 +5418,7 @@ [(match_operand:XI 0 "neon_struct_operand") (match_operand:XI 1 "s_register_operand") (unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { emit_insn (gen_neon_vst4<mode> (operands[0], operands[1])); DONE; @@ -5725,7 +5725,7 @@ [(set (match_operand:<V_widen> 0 "register_operand" "=w") (SE:<V_widen> (ashift:VW (match_operand:VW 1 "register_operand" "w") (match_operand:<V_innermode> 2 "const_neon_scalar_shift_amount_operand" ""))))] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { return "vshll.<US><V_sz_elem> %q0, %P1, %2"; } @@ -5771,7 +5771,7 @@ (define_expand "vec_unpack<US>_lo_<mode>" [(match_operand:<V_double_width> 0 "register_operand" "") (SE:<V_double_width>(match_operand:VDI 1 "register_operand"))] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { rtx tmpreg = gen_reg_rtx (<V_widen>mode); emit_insn (gen_neon_unpack<US>_<mode> (tmpreg, operands[1])); @@ -5784,7 +5784,7 @@ (define_expand "vec_unpack<US>_hi_<mode>" [(match_operand:<V_double_width> 0 "register_operand" "") (SE:<V_double_width>(match_operand:VDI 1 "register_operand"))] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { rtx tmpreg = gen_reg_rtx (<V_widen>mode); emit_insn (gen_neon_unpack<US>_<mode> (tmpreg, operands[1])); @@ -5800,7 +5800,7 @@ (match_operand:VDI 1 "register_operand" "w")) (SE:<V_widen> (match_operand:VDI 2 "register_operand" "w"))))] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" "vmull.<US><V_sz_elem> %q0, %P1, %P2" [(set_attr "neon_type" "neon_shift_1")] ) @@ -5809,7 +5809,7 @@ [(match_operand:<V_double_width> 0 "register_operand" "") (SE:<V_double_width> (match_operand:VDI 1 "register_operand" "")) (SE:<V_double_width> (match_operand:VDI 2 "register_operand" ""))] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { rtx tmpreg = gen_reg_rtx (<V_widen>mode); emit_insn (gen_neon_vec_<US>mult_<mode> (tmpreg, operands[1], operands[2])); @@ -5824,7 +5824,7 @@ [(match_operand:<V_double_width> 0 "register_operand" "") (SE:<V_double_width> (match_operand:VDI 1 "register_operand" "")) (SE:<V_double_width> (match_operand:VDI 2 "register_operand" ""))] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { rtx tmpreg = gen_reg_rtx (<V_widen>mode); emit_insn (gen_neon_vec_<US>mult_<mode> (tmpreg, operands[1], operands[2])); @@ -5839,7 +5839,7 @@ [(match_operand:<V_double_width> 0 "register_operand" "") (SE:<V_double_width> (match_operand:VDI 1 "register_operand" "")) (match_operand:SI 2 "immediate_operand" "i")] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { rtx tmpreg = gen_reg_rtx (<V_widen>mode); emit_insn (gen_neon_vec_<US>shiftl_<mode> (tmpreg, operands[1], operands[2])); @@ -5853,7 +5853,7 @@ [(match_operand:<V_double_width> 0 "register_operand" "") (SE:<V_double_width> (match_operand:VDI 1 "register_operand" "")) (match_operand:SI 2 "immediate_operand" "i")] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { rtx tmpreg = gen_reg_rtx (<V_widen>mode); emit_insn (gen_neon_vec_<US>shiftl_<mode> (tmpreg, operands[1], operands[2]));