[v3,testsuite,arm,vect] adjust mve-vshr test [PR113281]

Message ID	ora5je918y.fsf_-_@lxoliva.fsfla.org
State	New
Headers	show Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org> DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5D8E33894C27 From: Alexandre Oliva <oliva@adacore.com> To: Christophe Lyon <christophe.lyon@linaro.org> Cc: "Richard Earnshaw (lists)" <Richard.Earnshaw@arm.com>, gcc-patches@gcc.gnu.org, Rainer Orth <ro@cebitec.uni-bielefeld.de>, Mike Stump <mikestump@comcast.net>, Nick Clifton <nickc@redhat.com>, Ramana Radhakrishnan <ramana.gcc@gmail.com> Subject: [PATCH v3] [testsuite] [arm] [vect] adjust mve-vshr test [PR113281] Organization: Free thinker, does not speak for AdaCore References: <ortthxfb6s.fsf@lxoliva.fsfla.org> <2cae44c9-e677-4c1b-acfb-ab408de28a57@arm.com> <oriky399bi.fsf_-_@lxoliva.fsfla.org> <CAPS5khYRE4QNDxGrLE5sA_abKcD2kZfYHBKNA-d=RtwAJgSuhg@mail.gmail.com> Date: Fri, 21 Jun 2024 04:57:33 -0300 In-Reply-To: <CAPS5khYRE4QNDxGrLE5sA_abKcD2kZfYHBKNA-d=RtwAJgSuhg@mail.gmail.com> (Christophe Lyon's message of "Thu, 20 Jun 2024 13:57:45 +0200") Message-ID: <ora5je918y.fsf_-_@lxoliva.fsfla.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Precedence: list Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org
Series	[v3,testsuite,arm,vect] adjust mve-vshr test [PR113281] \| expand [v3,testsuite,arm,vect] adjust mve-vshr test [PR113281]

Alexandre Oliva June 21, 2024, 7:57 a.m. UTC

On Jun 20, 2024, Christophe Lyon <christophe.lyon@linaro.org> wrote:

> Maybe using
> if ((unsigned)b[i] >= BITS) \
> would be clearer?

Heh.  Why make it simpler if we can make it unreadable, right? :-D

Thanks, here's another version I've just retested on x-arm-eabi.  Ok?

I'm not sure how to credit your suggestion.  It's not like you pretty
much wrote the entire patch, as in Richard's case, but it's still a
sizable chunk of this two-liner.  Any preferences?

The test was too optimistic, alas.  We used to vectorize shifts
involving 8-bit and 16-bit integral types by clamping the shift count
at the highest in-range shift count, but that was not correct: such
narrow shifts expect integral promotion, so larger shift counts should
be accepted.  (int16_t)32768 >> (int16_t)16 must yield 0, not 1 (as
before the fix).

Unfortunately, in the gimple model of vector units, such large shift
counts wouldn't be well-defined, so we won't vectorize such shifts any
more, unless we can tell they're in range or undefined.

So the test that expected the incorrect clamping we no longer perform
needs to be adjusted.  Instead of nobbling the test, Richard Earnshaw
suggested annotating the test with the expected ranges so as to enable
the optimization.

Co-Authored-By: Richard Earnshaw <Richard.Earnshaw@arm.com>

for  gcc/testsuite/ChangeLog

	PR tree-optimization/113281
	* gcc.target/arm/simd/mve-vshr.c: Add expected ranges.
---
 gcc/testsuite/gcc.target/arm/simd/mve-vshr.c |    2 ++
 1 file changed, 2 insertions(+)

Richard Earnshaw (lists) June 21, 2024, 10:14 a.m. UTC | #1

On 21/06/2024 08:57, Alexandre Oliva wrote:
> On Jun 20, 2024, Christophe Lyon <christophe.lyon@linaro.org> wrote:
> 
>> Maybe using
>> if ((unsigned)b[i] >= BITS) \
>> would be clearer?
> 
> Heh.  Why make it simpler if we can make it unreadable, right? :-D
> 
> Thanks, here's another version I've just retested on x-arm-eabi.  Ok?
> 
> I'm not sure how to credit your suggestion.  It's not like you pretty
> much wrote the entire patch, as in Richard's case, but it's still a
> sizable chunk of this two-liner.  Any preferences?

How about mentioning Christophe's simplification in the commit log?
> 
> 
> The test was too optimistic, alas.  We used to vectorize shifts
> involving 8-bit and 16-bit integral types by clamping the shift count
> at the highest in-range shift count, but that was not correct: such
> narrow shifts expect integral promotion, so larger shift counts should
> be accepted.  (int16_t)32768 >> (int16_t)16 must yield 0, not 1 (as
> before the fix).

This is OK, but you might wish to revisit this statement before committing.  I think the above is a mis-summary of the original bug report which had a test to pick between 0 and 1 as the result of a shift operation.

If I've understood what's going on here correctly, then we have 

(int16_t)32768 >> (int16_t) 16

but shift is always done at int precision, so this is (due to default promotions)

(int)(int16_t)32768 >> 16  // size/type of the shift amount does not matter.

which then simplifies to

-32768 >> 16;  // 0xffff8000 >> 16

= -1;

I think the original bug was that we were losing the cast to short (and hence the sign extension of the intermediate value), so effectively we simplified this to 

32768 >> 16; // 0x00008000 >> 16

= 0;

And the other part of the observation was that it had to be done this way (and couldn't be narrowed for vectorization) because 16 is larger than the maximum shift for a short (actually you say that just below).

R.

> 
> Unfortunately, in the gimple model of vector units, such large shift
> counts wouldn't be well-defined, so we won't vectorize such shifts any
> more, unless we can tell they're in range or undefined.
> 
> So the test that expected the incorrect clamping we no longer perform
> needs to be adjusted.  Instead of nobbling the test, Richard Earnshaw
> suggested annotating the test with the expected ranges so as to enable
> the optimization.
> 
> 
> Co-Authored-By: Richard Earnshaw <Richard.Earnshaw@arm.com>
> 
> for  gcc/testsuite/ChangeLog
> 
> 	PR tree-optimization/113281
> 	* gcc.target/arm/simd/mve-vshr.c: Add expected ranges.
> ---
>  gcc/testsuite/gcc.target/arm/simd/mve-vshr.c |    2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c b/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c
> index 8c7adef9ed8f1..03078de49c65e 100644
> --- a/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c
> +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c
> @@ -9,6 +9,8 @@
>    void test_ ## NAME ##_ ## SIGN ## BITS ## x ## NB (TYPE##BITS##_t * __restrict__ dest, TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
>      int i;								\
>      for (i=0; i<NB; i++) {						\
> +      if ((unsigned)b[i] >= (unsigned)(BITS))				\
> +	__builtin_unreachable();					\
>        dest[i] = a[i] OP b[i];						\
>      }									\
>  }
> 
>

Christophe Lyon June 21, 2024, 1:39 p.m. UTC | #2

On Fri, 21 Jun 2024 at 12:14, Richard Earnshaw (lists)
<Richard.Earnshaw@arm.com> wrote:
>
> On 21/06/2024 08:57, Alexandre Oliva wrote:
> > On Jun 20, 2024, Christophe Lyon <christophe.lyon@linaro.org> wrote:
> >
> >> Maybe using
> >> if ((unsigned)b[i] >= BITS) \
> >> would be clearer?
> >
> > Heh.  Why make it simpler if we can make it unreadable, right? :-D
> >
> > Thanks, here's another version I've just retested on x-arm-eabi.  Ok?
> >
> > I'm not sure how to credit your suggestion.  It's not like you pretty
> > much wrote the entire patch, as in Richard's case, but it's still a
> > sizable chunk of this two-liner.  Any preferences?
>
> How about mentioning Christophe's simplification in the commit log?

For the avoidance of doubt: it's OK for me (but you don't need to
mention my name in fact ;-)

Thanks,

Christophe

> >
> >
> > The test was too optimistic, alas.  We used to vectorize shifts
> > involving 8-bit and 16-bit integral types by clamping the shift count
> > at the highest in-range shift count, but that was not correct: such
> > narrow shifts expect integral promotion, so larger shift counts should
> > be accepted.  (int16_t)32768 >> (int16_t)16 must yield 0, not 1 (as
> > before the fix).
>
> This is OK, but you might wish to revisit this statement before committing.  I think the above is a mis-summary of the original bug report which had a test to pick between 0 and 1 as the result of a shift operation.
>
> If I've understood what's going on here correctly, then we have
>
> (int16_t)32768 >> (int16_t) 16
>
> but shift is always done at int precision, so this is (due to default promotions)
>
> (int)(int16_t)32768 >> 16  // size/type of the shift amount does not matter.
>
> which then simplifies to
>
> -32768 >> 16;  // 0xffff8000 >> 16
>
> = -1;
>
> I think the original bug was that we were losing the cast to short (and hence the sign extension of the intermediate value), so effectively we simplified this to
>
> 32768 >> 16; // 0x00008000 >> 16
>
> = 0;
>
> And the other part of the observation was that it had to be done this way (and couldn't be narrowed for vectorization) because 16 is larger than the maximum shift for a short (actually you say that just below).
>
> R.
>
> >
> > Unfortunately, in the gimple model of vector units, such large shift
> > counts wouldn't be well-defined, so we won't vectorize such shifts any
> > more, unless we can tell they're in range or undefined.
> >
> > So the test that expected the incorrect clamping we no longer perform
> > needs to be adjusted.  Instead of nobbling the test, Richard Earnshaw
> > suggested annotating the test with the expected ranges so as to enable
> > the optimization.
> >
> >
> > Co-Authored-By: Richard Earnshaw <Richard.Earnshaw@arm.com>
> >
> > for  gcc/testsuite/ChangeLog
> >
> >       PR tree-optimization/113281
> >       * gcc.target/arm/simd/mve-vshr.c: Add expected ranges.
> > ---
> >  gcc/testsuite/gcc.target/arm/simd/mve-vshr.c |    2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c b/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c
> > index 8c7adef9ed8f1..03078de49c65e 100644
> > --- a/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c
> > +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c
> > @@ -9,6 +9,8 @@
> >    void test_ ## NAME ##_ ## SIGN ## BITS ## x ## NB (TYPE##BITS##_t * __restrict__ dest, TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
> >      int i;                                                           \
> >      for (i=0; i<NB; i++) {                                           \
> > +      if ((unsigned)b[i] >= (unsigned)(BITS))                                \
> > +     __builtin_unreachable();                                        \
> >        dest[i] = a[i] OP b[i];                                                \
> >      }                                                                        \
> >  }
> >
> >
>

Alexandre Oliva June 24, 2024, 11:35 a.m. UTC | #3

On Jun 21, 2024, Christophe Lyon <christophe.lyon@linaro.org> wrote:

>> How about mentioning Christophe's simplification in the commit log?

> For the avoidance of doubt: it's OK for me (but you don't need to
> mention my name in fact ;-)

Needing or not, I added it ;-)

>> > be accepted.  (int16_t)32768 >> (int16_t)16 must yield 0, not 1 (as
>> > before the fix).

>> This is OK, but you might wish to revisit this statement before
>> committing.

Oh, right, sorry, I messed it up.  uint16_t was what I should have put
in there.  int16_t would have overflown and invoked undefined behavior
to begin with, and I see it misled you far down the wrong path.

>> I think the original bug was that we were losing the cast to short

The problem was that the shift count saturated at 15.  AFAIK sign
extension was not relevant.  Hopefully the rewritten opening paragraph
below makes that clearer.  I will put it in later this week barring
objections or further suggestions of improvement.  Thanks,

[testsuite] [arm] [vect] adjust mve-vshr test [PR113281]

The test was too optimistic, alas.  We used to vectorize shifts by
clamping the shift counts below the bit width of the types (e.g. at 15
for 16-bit vector elements), but (uint16_t)32768 >> (uint16_t)16 is
well defined (because of promotion to 32-bit int) and must yield 0,
not 1 (as before the fix).

Unfortunately, in the gimple model of vector units, such large shift
counts wouldn't be well-defined, so we won't vectorize such shifts any
more, unless we can tell they're in range or undefined.

So the test that expected the vectorization we no longer performed
needs to be adjusted.  Instead of nobbling the test, Richard Earnshaw
suggested annotating the test with the expected ranges so as to enable
the optimization, and Christophe Lyon suggested a further
simplification.

Co-Authored-By: Richard Earnshaw <Richard.Earnshaw@arm.com>

for  gcc/testsuite/ChangeLog

	PR tree-optimization/113281
	* gcc.target/arm/simd/mve-vshr.c: Add expected ranges.
---
 gcc/testsuite/gcc.target/arm/simd/mve-vshr.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c b/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c
index 8c7adef9ed8f1..03078de49c65e 100644
--- a/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c
+++ b/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c
@@ -9,6 +9,8 @@
   void test_ ## NAME ##_ ## SIGN ## BITS ## x ## NB (TYPE##BITS##_t * __restrict__ dest, TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
     int i;								\
     for (i=0; i<NB; i++) {						\
+      if ((unsigned)b[i] >= (unsigned)(BITS))				\
+	__builtin_unreachable();					\
       dest[i] = a[i] OP b[i];						\
     }									\
 }

Richard Earnshaw (lists) June 24, 2024, 4:20 p.m. UTC | #4

On 24/06/2024 12:35, Alexandre Oliva wrote:
> On Jun 21, 2024, Christophe Lyon <christophe.lyon@linaro.org> wrote:
> 
>>> How about mentioning Christophe's simplification in the commit log?
> 
>> For the avoidance of doubt: it's OK for me (but you don't need to
>> mention my name in fact ;-)
> 
> Needing or not, I added it ;-)
> 
>>>> be accepted.  (int16_t)32768 >> (int16_t)16 must yield 0, not 1 (as
>>>> before the fix).
> 
>>> This is OK, but you might wish to revisit this statement before
>>> committing.
> 
> Oh, right, sorry, I messed it up.  uint16_t was what I should have put
> in there.  int16_t would have overflown and invoked undefined behavior
> to begin with, and I see it misled you far down the wrong path.
> 
>>> I think the original bug was that we were losing the cast to short
> 
> The problem was that the shift count saturated at 15.  AFAIK sign
> extension was not relevant.  Hopefully the rewritten opening paragraph
> below makes that clearer.  I will put it in later this week barring
> objections or further suggestions of improvement.  Thanks,

A signed shift right on a 16-bit vector element by 15 would still yield -1; but ...

> 
> 
> [testsuite] [arm] [vect] adjust mve-vshr test [PR113281]
> 
> The test was too optimistic, alas.  We used to vectorize shifts by
> clamping the shift counts below the bit width of the types (e.g. at 15
> for 16-bit vector elements), but (uint16_t)32768 >> (uint16_t)16 is
> well defined (because of promotion to 32-bit int) and must yield 0,
> not 1 (as before the fix).

That make more sense now.  Thanks.

> 
> Unfortunately, in the gimple model of vector units, such large shift
> counts wouldn't be well-defined, so we won't vectorize such shifts any
> more, unless we can tell they're in range or undefined.
> 
> So the test that expected the vectorization we no longer performed
> needs to be adjusted.  Instead of nobbling the test, Richard Earnshaw
> suggested annotating the test with the expected ranges so as to enable
> the optimization, and Christophe Lyon suggested a further
> simplification.
> 
> 
> Co-Authored-By: Richard Earnshaw <Richard.Earnshaw@arm.com>
> 
> for  gcc/testsuite/ChangeLog
> 
> 	PR tree-optimization/113281
> 	* gcc.target/arm/simd/mve-vshr.c: Add expected ranges.

I think this is OK now.

R.

> ---
>  gcc/testsuite/gcc.target/arm/simd/mve-vshr.c |    2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c b/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c
> index 8c7adef9ed8f1..03078de49c65e 100644
> --- a/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c
> +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vshr.c
> @@ -9,6 +9,8 @@
>    void test_ ## NAME ##_ ## SIGN ## BITS ## x ## NB (TYPE##BITS##_t * __restrict__ dest, TYPE##BITS##_t *a, TYPE##BITS##_t *b) { \
>      int i;								\
>      for (i=0; i<NB; i++) {						\
> +      if ((unsigned)b[i] >= (unsigned)(BITS))				\
> +	__builtin_unreachable();					\
>        dest[i] = a[i] OP b[i];						\
>      }									\
>  }
> 
>

Alexandre Oliva June 24, 2024, 7:01 p.m. UTC | #5

On Jun 24, 2024, "Richard Earnshaw (lists)" <Richard.Earnshaw@arm.com> wrote:

> A signed shift right on a 16-bit vector element by 15 would still
> yield -1

Yeah.  Indeed, ISTM that we *could* have retained the clamping
transformation for *signed* shifts, since the clamping would only make a
difference in case of (undefined) overflow.  Only for unsigned shifts
can well-defined shifts yield different results with clamping.

Richard (Sandiford), do you happen to recall why the IRC conversation
mentioned in the PR trail decided to drop it entirely, even for signed
types?

Richard Sandiford June 25, 2024, 5:33 p.m. UTC | #6

Alexandre Oliva <oliva@adacore.com> writes:
> On Jun 24, 2024, "Richard Earnshaw (lists)" <Richard.Earnshaw@arm.com> wrote:
>
>> A signed shift right on a 16-bit vector element by 15 would still
>> yield -1
>
> Yeah.  Indeed, ISTM that we *could* have retained the clamping
> transformation for *signed* shifts, since the clamping would only make a
> difference in case of (undefined) overflow.  Only for unsigned shifts
> can well-defined shifts yield different results with clamping.
>
> Richard (Sandiford), do you happen to recall why the IRC conversation
> mentioned in the PR trail decided to drop it entirely, even for signed
> types?

In the PR, the original shift was 32768 >> x (x >= 16) on ints, which the
vectoriser was narrowing to 32768 >> x' on shorts.  The original shift is
well-defined for both signed and unsigned shifts, and no valid x' exists
for that case.

Thanks,
Richard

Alexandre Oliva June 26, 2024, 5:15 a.m. UTC | #7

On Jun 25, 2024, Richard Sandiford <richard.sandiford@arm.com> wrote:

>> Richard (Sandiford), do you happen to recall why the IRC conversation
>> mentioned in the PR trail decided to drop it entirely, even for signed
>> types?

> In the PR, the original shift was 32768 >> x (x >= 16) on ints, which the
> vectoriser was narrowing to 32768 >> x' on shorts.  The original shift is
> well-defined for both signed and unsigned shifts, and no valid x' exists
> for that case.

It sounds like shifts on shorts proper, that would have benefitted from
the optimization, was not covered and thus there may be room for
reconsidering, eh?

Richard Sandiford June 26, 2024, 6:36 a.m. UTC | #8

Alexandre Oliva <oliva@adacore.com> writes:
> On Jun 25, 2024, Richard Sandiford <richard.sandiford@arm.com> wrote:
>
>>> Richard (Sandiford), do you happen to recall why the IRC conversation
>>> mentioned in the PR trail decided to drop it entirely, even for signed
>>> types?
>
>> In the PR, the original shift was 32768 >> x (x >= 16) on ints, which the
>> vectoriser was narrowing to 32768 >> x' on shorts.  The original shift is
>> well-defined for both signed and unsigned shifts, and no valid x' exists
>> for that case.
>
> It sounds like shifts on shorts proper, that would have benefitted from
> the optimization, was not covered and thus there may be room for
> reconsidering, eh?

What kind of case are you thinking of?  If a frontend creates a true
16-bit shift then it wouldn't need to be narrowed by this optimisation.

Like you say in the commit message, the problem here is that the values
are promoted to ints and then shifted as 32-bit values, so shifts by
>= 16 are well-defined.

Richard

Alexandre Oliva June 27, 2024, 9:33 a.m. UTC | #9

On Jun 26, 2024, Richard Sandiford <richard.sandiford@arm.com> wrote:

> Alexandre Oliva <oliva@adacore.com> writes:
>> On Jun 25, 2024, Richard Sandiford <richard.sandiford@arm.com> wrote:
>> 
>>>> Richard (Sandiford), do you happen to recall why the IRC conversation
>>>> mentioned in the PR trail decided to drop it entirely, even for signed
>>>> types?
>> 
>>> In the PR, the original shift was 32768 >> x (x >= 16) on ints, which the
>>> vectoriser was narrowing to 32768 >> x' on shorts.  The original shift is
>>> well-defined for both signed and unsigned shifts, and no valid x' exists
>>> for that case.
>> 
>> It sounds like shifts on shorts proper, that would have benefitted from
>> the optimization, was not covered and thus there may be room for
>> reconsidering, eh?

> What kind of case are you thinking of?  If a frontend creates a true
> 16-bit shift then it wouldn't need to be narrowed by this optimisation.

I'm thinking of *any* (looped over arrays) shifts of *signed* shorts.
The compiler can't generally tell that the shift count is < 16, as would
now be required to use the vector instructions, but that's not
necessary: for *signed* shorts, clamping the shift count at 15 like we
used to do is enough to get the correct result for well-defined (as in
non-overflowing) operations.  ISTM we've given up a useful optimization.

[v3,testsuite,arm,vect] adjust mve-vshr test [PR113281]

Commit Message

Comments

Patch