Message ID | ora5je918y.fsf_-_@lxoliva.fsfla.org |
---|---|
State | New |
Headers | show |
Series | [v3,testsuite,arm,vect] adjust mve-vshr test [PR113281] | expand |
On 21/06/2024 08:57, Alexandre Oliva wrote: > On Jun 20, 2024, Christophe Lyon <christophe.lyon@linaro.org> wrote: > >> Maybe using >> if ((unsigned)b[i] >= BITS) \ >> would be clearer? > > Heh. Why make it simpler if we can make it unreadable, right? :-D > > Thanks, here's another version I've just retested on x-arm-eabi. Ok? > > I'm not sure how to credit your suggestion. It's not like you pretty > much wrote the entire patch, as in Richard's case, but it's still a > sizable chunk of this two-liner. Any preferences? How about mentioning Christophe's simplification in the commit log? > > > The test was too optimistic, alas. We used to vectorize shifts > involving 8-bit and 16-bit integral types by clamping the shift count > at the highest in-range shift count, but that was not correct: such > narrow shifts expect integral promotion, so larger shift counts should > be accepted. (int16_t)32768 >> (int16_t)16 must yield 0, not 1 (as > before the fix). This is OK, but you might wish to revisit this statement before committing. I think the above is a mis-summary of the original bug report which had a test to pick between 0 and 1 as the result of a shift operation. If I've understood what's going on here correctly, then we have (int16_t)32768 >> (int16_t) 16 but shift is always done at int precision, so this is (due to default promotions) (int)(int16_t)32768 >> 16 // size/type of the shift amount does not matter. which then simplifies to -32768 >> 16; // 0xffff8000 >> 16 = -1; I think the original bug was that we were losing the cast to short (and hence the sign extension of the intermediate value), so effectively we simplified this to 32768 >> 16; // 0x00008000 >> 16 = 0; And the other part of the observation was that it had to be done this way (and couldn't be narrowed for vectorization) because 16 is larger than the maximum shift for a short (actually you say that just below). R. > > Unfortunately, in the gimple model of vector units, such large shift > counts wouldn't be well-defined, so we won't vectorize such shifts any > more, unless we can tell they're in range or undefined. > > So the test that expected the incorrect clamping we no longer perform > needs to be adjusted. Instead of nobbling the test, Richard Earnshaw > suggested annotating the test with the expected ranges so as to enable > the optimization. > > > Co-Authored-By: Richard Earnshaw <Richard.Earnshaw@arm.com> > > for gcc/testsuite/ChangeLog > > PR tree-optimization/113281 > * gcc.target/arm/simd/mve-vshr.c: Add expected ranges. > --- > gcc/testsuite/gcc.target/arm/simd/mve-vshr.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c b/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c > index 8c7adef9ed8f1..03078de49c65e 100644 > --- a/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c > +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c > @@ -9,6 +9,8 @@ > void test_ ## NAME ##_ ## SIGN ## BITS ## x ## NB (TYPE##BITS##_t * __restrict__ dest, TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \ > int i; \ > for (i=0; i<NB; i++) { \ > + if ((unsigned)b[i] >= (unsigned)(BITS)) \ > + __builtin_unreachable(); \ > dest[i] = a[i] OP b[i]; \ > } \ > } > >
On Fri, 21 Jun 2024 at 12:14, Richard Earnshaw (lists) <Richard.Earnshaw@arm.com> wrote: > > On 21/06/2024 08:57, Alexandre Oliva wrote: > > On Jun 20, 2024, Christophe Lyon <christophe.lyon@linaro.org> wrote: > > > >> Maybe using > >> if ((unsigned)b[i] >= BITS) \ > >> would be clearer? > > > > Heh. Why make it simpler if we can make it unreadable, right? :-D > > > > Thanks, here's another version I've just retested on x-arm-eabi. Ok? > > > > I'm not sure how to credit your suggestion. It's not like you pretty > > much wrote the entire patch, as in Richard's case, but it's still a > > sizable chunk of this two-liner. Any preferences? > > How about mentioning Christophe's simplification in the commit log? For the avoidance of doubt: it's OK for me (but you don't need to mention my name in fact ;-) Thanks, Christophe > > > > > > The test was too optimistic, alas. We used to vectorize shifts > > involving 8-bit and 16-bit integral types by clamping the shift count > > at the highest in-range shift count, but that was not correct: such > > narrow shifts expect integral promotion, so larger shift counts should > > be accepted. (int16_t)32768 >> (int16_t)16 must yield 0, not 1 (as > > before the fix). > > This is OK, but you might wish to revisit this statement before committing. I think the above is a mis-summary of the original bug report which had a test to pick between 0 and 1 as the result of a shift operation. > > If I've understood what's going on here correctly, then we have > > (int16_t)32768 >> (int16_t) 16 > > but shift is always done at int precision, so this is (due to default promotions) > > (int)(int16_t)32768 >> 16 // size/type of the shift amount does not matter. > > which then simplifies to > > -32768 >> 16; // 0xffff8000 >> 16 > > = -1; > > I think the original bug was that we were losing the cast to short (and hence the sign extension of the intermediate value), so effectively we simplified this to > > 32768 >> 16; // 0x00008000 >> 16 > > = 0; > > And the other part of the observation was that it had to be done this way (and couldn't be narrowed for vectorization) because 16 is larger than the maximum shift for a short (actually you say that just below). > > R. > > > > > Unfortunately, in the gimple model of vector units, such large shift > > counts wouldn't be well-defined, so we won't vectorize such shifts any > > more, unless we can tell they're in range or undefined. > > > > So the test that expected the incorrect clamping we no longer perform > > needs to be adjusted. Instead of nobbling the test, Richard Earnshaw > > suggested annotating the test with the expected ranges so as to enable > > the optimization. > > > > > > Co-Authored-By: Richard Earnshaw <Richard.Earnshaw@arm.com> > > > > for gcc/testsuite/ChangeLog > > > > PR tree-optimization/113281 > > * gcc.target/arm/simd/mve-vshr.c: Add expected ranges. > > --- > > gcc/testsuite/gcc.target/arm/simd/mve-vshr.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c b/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c > > index 8c7adef9ed8f1..03078de49c65e 100644 > > --- a/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c > > +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c > > @@ -9,6 +9,8 @@ > > void test_ ## NAME ##_ ## SIGN ## BITS ## x ## NB (TYPE##BITS##_t * __restrict__ dest, TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \ > > int i; \ > > for (i=0; i<NB; i++) { \ > > + if ((unsigned)b[i] >= (unsigned)(BITS)) \ > > + __builtin_unreachable(); \ > > dest[i] = a[i] OP b[i]; \ > > } \ > > } > > > > >
On Jun 21, 2024, Christophe Lyon <christophe.lyon@linaro.org> wrote: >> How about mentioning Christophe's simplification in the commit log? > For the avoidance of doubt: it's OK for me (but you don't need to > mention my name in fact ;-) Needing or not, I added it ;-) >> > be accepted. (int16_t)32768 >> (int16_t)16 must yield 0, not 1 (as >> > before the fix). >> This is OK, but you might wish to revisit this statement before >> committing. Oh, right, sorry, I messed it up. uint16_t was what I should have put in there. int16_t would have overflown and invoked undefined behavior to begin with, and I see it misled you far down the wrong path. >> I think the original bug was that we were losing the cast to short The problem was that the shift count saturated at 15. AFAIK sign extension was not relevant. Hopefully the rewritten opening paragraph below makes that clearer. I will put it in later this week barring objections or further suggestions of improvement. Thanks, [testsuite] [arm] [vect] adjust mve-vshr test [PR113281] The test was too optimistic, alas. We used to vectorize shifts by clamping the shift counts below the bit width of the types (e.g. at 15 for 16-bit vector elements), but (uint16_t)32768 >> (uint16_t)16 is well defined (because of promotion to 32-bit int) and must yield 0, not 1 (as before the fix). Unfortunately, in the gimple model of vector units, such large shift counts wouldn't be well-defined, so we won't vectorize such shifts any more, unless we can tell they're in range or undefined. So the test that expected the vectorization we no longer performed needs to be adjusted. Instead of nobbling the test, Richard Earnshaw suggested annotating the test with the expected ranges so as to enable the optimization, and Christophe Lyon suggested a further simplification. Co-Authored-By: Richard Earnshaw <Richard.Earnshaw@arm.com> for gcc/testsuite/ChangeLog PR tree-optimization/113281 * gcc.target/arm/simd/mve-vshr.c: Add expected ranges. --- gcc/testsuite/gcc.target/arm/simd/mve-vshr.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c b/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c index 8c7adef9ed8f1..03078de49c65e 100644 --- a/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c @@ -9,6 +9,8 @@ void test_ ## NAME ##_ ## SIGN ## BITS ## x ## NB (TYPE##BITS##_t * __restrict__ dest, TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \ int i; \ for (i=0; i<NB; i++) { \ + if ((unsigned)b[i] >= (unsigned)(BITS)) \ + __builtin_unreachable(); \ dest[i] = a[i] OP b[i]; \ } \ }
On 24/06/2024 12:35, Alexandre Oliva wrote: > On Jun 21, 2024, Christophe Lyon <christophe.lyon@linaro.org> wrote: > >>> How about mentioning Christophe's simplification in the commit log? > >> For the avoidance of doubt: it's OK for me (but you don't need to >> mention my name in fact ;-) > > Needing or not, I added it ;-) > >>>> be accepted. (int16_t)32768 >> (int16_t)16 must yield 0, not 1 (as >>>> before the fix). > >>> This is OK, but you might wish to revisit this statement before >>> committing. > > Oh, right, sorry, I messed it up. uint16_t was what I should have put > in there. int16_t would have overflown and invoked undefined behavior > to begin with, and I see it misled you far down the wrong path. > >>> I think the original bug was that we were losing the cast to short > > The problem was that the shift count saturated at 15. AFAIK sign > extension was not relevant. Hopefully the rewritten opening paragraph > below makes that clearer. I will put it in later this week barring > objections or further suggestions of improvement. Thanks, A signed shift right on a 16-bit vector element by 15 would still yield -1; but ... > > > [testsuite] [arm] [vect] adjust mve-vshr test [PR113281] > > The test was too optimistic, alas. We used to vectorize shifts by > clamping the shift counts below the bit width of the types (e.g. at 15 > for 16-bit vector elements), but (uint16_t)32768 >> (uint16_t)16 is > well defined (because of promotion to 32-bit int) and must yield 0, > not 1 (as before the fix). That make more sense now. Thanks. > > Unfortunately, in the gimple model of vector units, such large shift > counts wouldn't be well-defined, so we won't vectorize such shifts any > more, unless we can tell they're in range or undefined. > > So the test that expected the vectorization we no longer performed > needs to be adjusted. Instead of nobbling the test, Richard Earnshaw > suggested annotating the test with the expected ranges so as to enable > the optimization, and Christophe Lyon suggested a further > simplification. > > > Co-Authored-By: Richard Earnshaw <Richard.Earnshaw@arm.com> > > for gcc/testsuite/ChangeLog > > PR tree-optimization/113281 > * gcc.target/arm/simd/mve-vshr.c: Add expected ranges. I think this is OK now. R. > --- > gcc/testsuite/gcc.target/arm/simd/mve-vshr.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c b/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c > index 8c7adef9ed8f1..03078de49c65e 100644 > --- a/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c > +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c > @@ -9,6 +9,8 @@ > void test_ ## NAME ##_ ## SIGN ## BITS ## x ## NB (TYPE##BITS##_t * __restrict__ dest, TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \ > int i; \ > for (i=0; i<NB; i++) { \ > + if ((unsigned)b[i] >= (unsigned)(BITS)) \ > + __builtin_unreachable(); \ > dest[i] = a[i] OP b[i]; \ > } \ > } > >
On Jun 24, 2024, "Richard Earnshaw (lists)" <Richard.Earnshaw@arm.com> wrote: > A signed shift right on a 16-bit vector element by 15 would still > yield -1 Yeah. Indeed, ISTM that we *could* have retained the clamping transformation for *signed* shifts, since the clamping would only make a difference in case of (undefined) overflow. Only for unsigned shifts can well-defined shifts yield different results with clamping. Richard (Sandiford), do you happen to recall why the IRC conversation mentioned in the PR trail decided to drop it entirely, even for signed types?
Alexandre Oliva <oliva@adacore.com> writes: > On Jun 24, 2024, "Richard Earnshaw (lists)" <Richard.Earnshaw@arm.com> wrote: > >> A signed shift right on a 16-bit vector element by 15 would still >> yield -1 > > Yeah. Indeed, ISTM that we *could* have retained the clamping > transformation for *signed* shifts, since the clamping would only make a > difference in case of (undefined) overflow. Only for unsigned shifts > can well-defined shifts yield different results with clamping. > > Richard (Sandiford), do you happen to recall why the IRC conversation > mentioned in the PR trail decided to drop it entirely, even for signed > types? In the PR, the original shift was 32768 >> x (x >= 16) on ints, which the vectoriser was narrowing to 32768 >> x' on shorts. The original shift is well-defined for both signed and unsigned shifts, and no valid x' exists for that case. Thanks, Richard
On Jun 25, 2024, Richard Sandiford <richard.sandiford@arm.com> wrote: >> Richard (Sandiford), do you happen to recall why the IRC conversation >> mentioned in the PR trail decided to drop it entirely, even for signed >> types? > In the PR, the original shift was 32768 >> x (x >= 16) on ints, which the > vectoriser was narrowing to 32768 >> x' on shorts. The original shift is > well-defined for both signed and unsigned shifts, and no valid x' exists > for that case. It sounds like shifts on shorts proper, that would have benefitted from the optimization, was not covered and thus there may be room for reconsidering, eh?
Alexandre Oliva <oliva@adacore.com> writes: > On Jun 25, 2024, Richard Sandiford <richard.sandiford@arm.com> wrote: > >>> Richard (Sandiford), do you happen to recall why the IRC conversation >>> mentioned in the PR trail decided to drop it entirely, even for signed >>> types? > >> In the PR, the original shift was 32768 >> x (x >= 16) on ints, which the >> vectoriser was narrowing to 32768 >> x' on shorts. The original shift is >> well-defined for both signed and unsigned shifts, and no valid x' exists >> for that case. > > It sounds like shifts on shorts proper, that would have benefitted from > the optimization, was not covered and thus there may be room for > reconsidering, eh? What kind of case are you thinking of? If a frontend creates a true 16-bit shift then it wouldn't need to be narrowed by this optimisation. Like you say in the commit message, the problem here is that the values are promoted to ints and then shifted as 32-bit values, so shifts by >= 16 are well-defined. Richard
On Jun 26, 2024, Richard Sandiford <richard.sandiford@arm.com> wrote: > Alexandre Oliva <oliva@adacore.com> writes: >> On Jun 25, 2024, Richard Sandiford <richard.sandiford@arm.com> wrote: >> >>>> Richard (Sandiford), do you happen to recall why the IRC conversation >>>> mentioned in the PR trail decided to drop it entirely, even for signed >>>> types? >> >>> In the PR, the original shift was 32768 >> x (x >= 16) on ints, which the >>> vectoriser was narrowing to 32768 >> x' on shorts. The original shift is >>> well-defined for both signed and unsigned shifts, and no valid x' exists >>> for that case. >> >> It sounds like shifts on shorts proper, that would have benefitted from >> the optimization, was not covered and thus there may be room for >> reconsidering, eh? > What kind of case are you thinking of? If a frontend creates a true > 16-bit shift then it wouldn't need to be narrowed by this optimisation. I'm thinking of *any* (looped over arrays) shifts of *signed* shorts. The compiler can't generally tell that the shift count is < 16, as would now be required to use the vector instructions, but that's not necessary: for *signed* shorts, clamping the shift count at 15 like we used to do is enough to get the correct result for well-defined (as in non-overflowing) operations. ISTM we've given up a useful optimization.
diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c b/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c index 8c7adef9ed8f1..03078de49c65e 100644 --- a/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c @@ -9,6 +9,8 @@ void test_ ## NAME ##_ ## SIGN ## BITS ## x ## NB (TYPE##BITS##_t * __restrict__ dest, TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \ int i; \ for (i=0; i<NB; i++) { \ + if ((unsigned)b[i] >= (unsigned)(BITS)) \ + __builtin_unreachable(); \ dest[i] = a[i] OP b[i]; \ } \ }