Message ID | 85da45b3271492b67b7f2a6f9474ce7a153e200c.1536144068.git.ams@codesourcery.com |
---|---|
State | New |
Headers | show |
Series | AMD GCN Port | expand |
<ams@codesourcery.com> writes: > This feature probably ought to be reworked as a proper target hook, but I would > like to know if this is the correct solution to the problem first. > > The problem is that GCN vectors have a fixed number of elements (64) and the > vector size varies with element size. E.g. V64QI is 64 bytes and V64SI is 256 > bytes. > > This is a problem because GCC has an assumption that a) vector registers are > fixed size, and b) if there are multiple vector sizes you want to pick one size > and stick with it for the whole function. The whole of the current vectorisation region rather than the whole function, but yeah, this is a fundamental assumption with the current autovec code. It's something that would be really good to fix... > This is a problem in various places, but mostly it's not fatal. However, > get_vectype_for_scalar_type caches the vector size for the first type it > encounters and then tries to apply that to all subsequent vectors, which > completely destroys vectorization. The caching feature appears to be an > attempt to cope with AVX having a different vector size to other x86 vector > options. > > This patch simply disables the cache so that it must ask the backend for the > preferred mode for every type. TBH I'm surprised this works. Obviously it does, otherwise you wouldn't have posted it, but it seems like an accident. Various parts of the vectoriser query current_vector_size and expect it to be stable for the current choice of vector size. The underlying problem also affects (at least) base AArch64, SVE and x86_64. We try to choose vector types on the fly based only on the type of a given scalar value, but in reality, the type we want for a 32-bit element (say) often depends on whether the vectorisation region also has smaller or larger elements. And in general we only know that after vect_mark_stmts_to_be_vectorized, but we want to know the vector types earlier, such as in pattern recognition and while building SLP trees. It's a bit of a chicken-and-egg problem... Richard
On 17/09/18 20:28, Richard Sandiford wrote: >> This patch simply disables the cache so that it must ask the backend for the >> preferred mode for every type. > > TBH I'm surprised this works. Obviously it does, otherwise you wouldn't > have posted it, but it seems like an accident. Various parts of the > vectoriser query current_vector_size and expect it to be stable for > the current choice of vector size. Indeed, this is why this remains only a half-baked patch: I wasn't confident it was the correct or whole solution. It works in so much as it fixes the immediate problem that I saw -- "no vector type" -- and makes a bunch of vect.exp testcases happy. It's quite possible that something else is unhappy with this. > The underlying problem also affects (at least) base AArch64, SVE and x86_64. > We try to choose vector types on the fly based only on the type of a given > scalar value, but in reality, the type we want for a 32-bit element (say) > often depends on whether the vectorisation region also has smaller or > larger elements. And in general we only know that after > vect_mark_stmts_to_be_vectorized, but we want to know the vector types > earlier, such as in pattern recognition and while building SLP trees. > It's a bit of a chicken-and-egg problem... I don't understand why the number of bits in a vector is the key information here? It would make sense if you were to say that the number of elements has to be fixed in a given region, because obviously that's tied to loop strides and such, but why the size? It seems like there is an architecture were you don't want to mix instruction types (SSE vs. AVX?) and that makes sense for that architecture, but if that's the case then we need to be able to turn it off for other architectures. For GCN, vectors are fully maskable, so we almost want such considerations to be completely ignored. We basically want it to act like it can have any size vector it likes, up to 64 elements. Andrew
Andrew Stubbs <ams@codesourcery.com> writes: > On 17/09/18 20:28, Richard Sandiford wrote: >>> This patch simply disables the cache so that it must ask the backend for the >>> preferred mode for every type. >> >> TBH I'm surprised this works. Obviously it does, otherwise you wouldn't >> have posted it, but it seems like an accident. Various parts of the >> vectoriser query current_vector_size and expect it to be stable for >> the current choice of vector size. > > Indeed, this is why this remains only a half-baked patch: I wasn't > confident it was the correct or whole solution. > > It works in so much as it fixes the immediate problem that I saw -- "no > vector type" -- and makes a bunch of vect.exp testcases happy. > > It's quite possible that something else is unhappy with this. > >> The underlying problem also affects (at least) base AArch64, SVE and x86_64. >> We try to choose vector types on the fly based only on the type of a given >> scalar value, but in reality, the type we want for a 32-bit element (say) >> often depends on whether the vectorisation region also has smaller or >> larger elements. And in general we only know that after >> vect_mark_stmts_to_be_vectorized, but we want to know the vector types >> earlier, such as in pattern recognition and while building SLP trees. >> It's a bit of a chicken-and-egg problem... > > I don't understand why the number of bits in a vector is the key > information here? Arguably it shouldn't be, and it's really just a proxy for the vector (sub)architecture. But this is "should be" vs. "is" :-) > It would make sense if you were to say that the number of elements has > to be fixed in a given region, because obviously that's tied to loop > strides and such, but why the size? > > It seems like there is an architecture were you don't want to mix > instruction types (SSE vs. AVX?) and that makes sense for that > architecture, but if that's the case then we need to be able to turn it > off for other architectures. It's not about trying to avoid mixing vector sizes: from what Jakub said earlier in the year, even x86 wants to do that (but can't yet). The idea is instead to try the available possibilities. E.g. for AArch64 we want to try SVE, 128-bit Advanced SIMD and 64-bit Advanced SIMD. With something like: int *ip; short *sp; for (int i = 0; i < n; ++i) ip[i] = sp[i]; there are three valid choices for Advanced SIMD: (1) use 1 128-bit vector of sp and 2 128-bit vectors of ip (2) use 1 64-bit vector of sp and 2 64-bit vectors of ip (3) use 1 64-bit vector of sp and 1 128-bit vector of ip At the moment we only try (1) and (2), but in practice, (3) should be better than (2) in most cases. I guess in some ways trying all three would be best, but if we only try two, trying (1) and (3) is better than trying (1) and (2). For: for (int i = 0; i < n; ++i) ip[i] += 1; there are two valid choices for Advanced SIMD: (4) use 1 128-bit vector of ip (5) use 1 64-bit vector of ip The problem for the current autovec set-up is that the ip type for 64-bit Advanced SIMD varies between (3) and (5): for (3) it's a 128-bit vector type and for (5) it's a 64-bit vector type. So the type we want for a given vector subarchitecture is partly determined by the other types in the region: it isn't simply a function of the subarchitecture and the element type. This is why the current autovec code only supports (1), (2), (4) and (5). And I think this is essentially the same limitation that you're hitting. > For GCN, vectors are fully maskable, so we almost want such > considerations to be completely ignored. We basically want it to act > like it can have any size vector it likes, up to 64 elements. SVE is similar. But even for SVE there's an equivalent trade-off between (1) and (3): (1') use 1 fully-populated vector for sp and 2 fully-populated vectors for ip (3') use 1 half-populated vector for sp and 1 fully-populated vector for ip Which is best for more complicated examples depends on the balance between ip-based work and sp-based work. The packing and unpacking in (1') has a cost, but it would pay off if there was much more sp work than ip work, since in that case (3') would spend most of its time operating on partially-populated vectors. Would the same be useful for GCN, or do you basically always want a VF of 64? None of this is a fundamental restriction in theory. It's just something that needs to be fixed. One approach would be to get the loop vectoriser to iterate over the number of lanes the target supports insteaad of all possible vector sizes. The problem is that on its own this would mean trying 4 lane counts even on targets with a single supported vector size. So we'd need to do something a bit smarter... Richard
On 18/09/18 12:21, Richard Sandiford wrote: > Would the same be useful for GCN, or do you basically always > want a VF of 64? Always 64; the vector size varies between 512-bit and 4096-bit, as needed. > None of this is a fundamental restriction in theory. It's just > something that needs to be fixed. > > One approach would be to get the loop vectoriser to iterate over the > number of lanes the target supports insteaad of all possible vector > sizes. The problem is that on its own this would mean trying 4 > lane counts even on targets with a single supported vector size. > So we'd need to do something a bit smarter... Yeah, that sounds like an interesting project, but way more than I think I need. Basically, we don't need to iterate over anything; there's only one option for each mode. For the purposes of this patch, might it be enough to track down all the places that use the current_vector_size and fix them up, somehow? Obviously, I'm not sure what that means just yet ... Andrew
On Tue, Sep 18, 2018 at 10:22 PM Andrew Stubbs <ams@codesourcery.com> wrote: > > On 18/09/18 12:21, Richard Sandiford wrote: > > Would the same be useful for GCN, or do you basically always > > want a VF of 64? > > Always 64; the vector size varies between 512-bit and 4096-bit, as needed. > > > None of this is a fundamental restriction in theory. It's just > > something that needs to be fixed. > > > > One approach would be to get the loop vectoriser to iterate over the > > number of lanes the target supports insteaad of all possible vector > > sizes. The problem is that on its own this would mean trying 4 > > lane counts even on targets with a single supported vector size. > > So we'd need to do something a bit smarter... > > Yeah, that sounds like an interesting project, but way more than I think > I need. Basically, we don't need to iterate over anything; there's only > one option for each mode. > > For the purposes of this patch, might it be enough to track down all the > places that use the current_vector_size and fix them up, somehow? > > Obviously, I'm not sure what that means just yet ... I think the only part that wants a "fixed" size is the code iterating over vector sizes. All the rest of the code simply wants to commit to a specific vector type for each DEF - to match the ISAs we've faced sofar the approach is simply to choose the vector type of current_vector_size size and proper element type. I've long wanted to fix that part in a way to actually commit to vector types later and compute the DEF vector type of a stmt by looking at the vector type of the USEs and the operation. So I guess the current_vector_size thing isn't too hard to get rid of, what you'd end up with would be using that size when you decide for vector types for loads (where there are no USEs with vector types, so for example this would not apply to gathers). So I'd say you want to refactor get_same_sized_vectype uses and make the size argument to get_vectype_for_scalar_type_and_size a hint only. Richard. > > Andrew
On 19/09/18 14:45, Richard Biener wrote: > So I guess the current_vector_size thing isn't too hard to get rid of, what > you'd end up with would be using that size when you decide for vector > types for loads (where there are no USEs with vector types, so for example > this would not apply to gathers). I've finally got back to looking at this ... My patch works because current_vector_size is only referenced in two places. One is passed to get_vectype_for_scalar_type_and_size, and that function simply calls targetm.vectorize.preferred_simd_mode when the requested size is zero. The other is passed to build_truth_vector_type, which only uses it to call targetm.vectorize.get_mask_mode, and the GCN backend ignores the size parameter because it only has one option. Presumably other backends would object to a zero size mask. So, as I said originally, the effect is that leaving current_vector_size zeroed means "always ask the backend". Pretty much everything else chains off of those places using get_same_sized_vectype, so ignoring current_vector_size is safe on GCN, and might even be safe on other architectures? > So I'd say you want to refactor get_same_sized_vectype uses and > make the size argument to get_vectype_for_scalar_type_and_size > a hint only. I've looked through the uses of get_same_sized_vectype and I've come to the conclusion that many of them really mean it. For example, vectorizable_bswap tries to reinterpret a vector register as a byte vector so that it can permute it. This is an optimization that won't work on GCN (because the vector registers don't work like that), but seems like a valid use of the vector size characteristic of other architectures. For another example, vectorizable_conversion is targeting the vec_pack_trunc patterns, and therefore really does want to specify the types. Again, this isn't something we want to do on GCN (a regular trunc pattern with a vector mode will work fine). However, vectorizable_operation seems to use it to try to match the input and output types to the same vector unit (i.e. vector size); at least that's my interpretation. It returns "not vectorizable" if the input and output vectors have different numbers of elements. For most operators the lhs and rhs types will be the same, so we're all good, but I imagine that this code will prevent TRUNC being vectorized on GCN because the "same size" vector doesn't exist, and it doesn't check if there's a vector with the same number of elements (I've not actually tried that, yet, and there may be extra magic elsewhere for that case, but YSWIM). I don't think changing this case to a new "get_same_length_vectype" would be appropriate for many architectures, so I'm not sure what to do here? We could fix this with new target hooks, perhaps? TARGET_VECTORIZE_REINTERPRET_VECTOR (vectype_in, scalartype_out) Returns a new vectype (or mode) that uses the same vector register as vectype_in, but has elements of scalartype_out. The default implementation would be get_same_sized_vectype. GCN would just return NULL, because you can't do that kind of optimization. TARGET_VECTORIZE_COMPATIBLE_VECTOR (opcode, vectype_in, scalartype_out) Returns a new vectype (or mode) that has the right number of elements for the opcode (i.e. the same number, or 2x for packed opcodes), and elements of scalartype_out. The backend might choose a different vector size, but promises that hardware can do the operation (i.e. it's not mixing vector units). The default implementation would be get_same_sized_vectype, for backward compatibility. GCN would simply return V64xx according to scalartype_out, and NULL for unsupported opcodes. Of course, none of this addresses the question of which vector size to choose in the first place. I've not figured out how it might ever start with a type other than the "preferred SIMD mode", yet. Thoughts? Andrew
On Fri, Sep 28, 2018 at 2:47 PM Andrew Stubbs <ams@codesourcery.com> wrote: > > On 19/09/18 14:45, Richard Biener wrote: > > So I guess the current_vector_size thing isn't too hard to get rid of, what > > you'd end up with would be using that size when you decide for vector > > types for loads (where there are no USEs with vector types, so for example > > this would not apply to gathers). > > I've finally got back to looking at this ... > > My patch works because current_vector_size is only referenced in two > places. One is passed to get_vectype_for_scalar_type_and_size, and that > function simply calls targetm.vectorize.preferred_simd_mode when the > requested size is zero. The other is passed to build_truth_vector_type, > which only uses it to call targetm.vectorize.get_mask_mode, and the GCN > backend ignores the size parameter because it only has one option. > Presumably other backends would object to a zero size mask. > > So, as I said originally, the effect is that leaving current_vector_size > zeroed means "always ask the backend". Yes. > Pretty much everything else chains off of those places using > get_same_sized_vectype, so ignoring current_vector_size is safe on GCN, > and might even be safe on other architectures? Other architectures really only use it when there's a choice, like choosing between V4SI, V8SI and V16SI on x86_64. current_vector_size was introduced to be able to "iterate" over supported ISAs and let the vectorizer decide which one to use in the end (SSE vs. AVX vs. AVX512). The value of zero is simply to give the target another chance to set its prefered value based on the first call. I'd call that a bit awkward (*) For architectures that only have a single "vector size" this variable is really spurious and whether it is zero or non-zero doesn't make a difference. Apart from your architecture of course where non-zero doesn't work ;) (*) so one possibility would be to forgo with the special-value of zero ("auto-detect") and thus not change current_vector_size in get_vectype_for_scalar_type at all. For targets which report multiple vector size support set current_vector_size to the prefered one in the loop over vector sizes and for targets that do not simply keep it at zero. > > So I'd say you want to refactor get_same_sized_vectype uses and > > make the size argument to get_vectype_for_scalar_type_and_size > > a hint only. > > I've looked through the uses of get_same_sized_vectype and I've come to > the conclusion that many of them really mean it. > > For example, vectorizable_bswap tries to reinterpret a vector register > as a byte vector so that it can permute it. This is an optimization that > won't work on GCN (because the vector registers don't work like that), > but seems like a valid use of the vector size characteristic of other > architectures. True. > For another example, vectorizable_conversion is targeting the > vec_pack_trunc patterns, and therefore really does want to specify the > types. Again, this isn't something we want to do on GCN (a regular trunc > pattern with a vector mode will work fine). > > However, vectorizable_operation seems to use it to try to match the > input and output types to the same vector unit (i.e. vector size); at > least that's my interpretation. It returns "not vectorizable" if the > input and output vectors have different numbers of elements. For most > operators the lhs and rhs types will be the same, so we're all good, but > I imagine that this code will prevent TRUNC being vectorized on GCN > because the "same size" vector doesn't exist, and it doesn't check if > there's a vector with the same number of elements (I've not actually > tried that, yet, and there may be extra magic elsewhere for that case, > but YSWIM). Yeah, we don't have a get_vector_type_for_scalar_type_and_nelems which would probably be semantically better in many places. > I don't think changing this case to a new "get_same_length_vectype" > would be appropriate for many architectures, so I'm not sure what to do > here? > > We could fix this with new target hooks, perhaps? > > TARGET_VECTORIZE_REINTERPRET_VECTOR (vectype_in, scalartype_out) > > Returns a new vectype (or mode) that uses the same vector register as > vectype_in, but has elements of scalartype_out. > > The default implementation would be get_same_sized_vectype. > > GCN would just return NULL, because you can't do that kind of > optimization. > > TARGET_VECTORIZE_COMPATIBLE_VECTOR (opcode, vectype_in, scalartype_out) > > Returns a new vectype (or mode) that has the right number of elements > for the opcode (i.e. the same number, or 2x for packed opcodes), and > elements of scalartype_out. The backend might choose a different > vector size, but promises that hardware can do the operation (i.e. > it's not mixing vector units). > > The default implementation would be get_same_sized_vectype, for > backward compatibility. > > GCN would simply return V64xx according to scalartype_out, and NULL > for unsupported opcodes. I don't like putting the burden on the target here too much given the vectorizer should know what kind of constraints it has given it implements the vectorization on GIMPLE which as IL constraints that are to be met - we just need to ask for vector types with the appropriate constraints rather than using same-size everywhere. > Of course, none of this addresses the question of which vector size to > choose in the first place. See above for a suggestion. > I've not figured out how it might ever start > with a type other than the "preferred SIMD mode", yet. In practically all cases vect_analyze_data_refs calling get_vectype_for_scalar_type on a load will be the one nailing down current_vector_size (if zero). I also cannot quickly think of a case where that would differ from "preferred SIMD mode" unless the target simply lies to us here ;) So, would a current_vector_size re-org like outlined above help you? I agree leaving it at zero should work unless there's code in the vectorizer that is simply wrong. Addressing some GCN issues with get_vectype_for_scalar_type_and_nunits would also OK with me (if that works). Thanks, Richard. > Thoughts? > > Andrew
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c index 607a2bd..8875201 100644 --- a/gcc/tree-vect-stmts.c +++ b/gcc/tree-vect-stmts.c @@ -9945,9 +9945,12 @@ get_vectype_for_scalar_type (tree scalar_type) tree vectype; vectype = get_vectype_for_scalar_type_and_size (scalar_type, current_vector_size); +/* FIXME: use a proper target hook or macro. */ +#ifndef TARGET_DISABLE_CURRENT_VECTOR_SIZE if (vectype && known_eq (current_vector_size, 0U)) current_vector_size = GET_MODE_SIZE (TYPE_MODE (vectype)); +#endif return vectype; }