diff mbox series

rs6000: Vector shift-right should honor modulo semantics

Message ID f1f5ac18-6698-303d-fa58-95e77453a4c2@linux.ibm.com
State New
Headers show
Series rs6000: Vector shift-right should honor modulo semantics | expand

Commit Message

Bill Schmidt Feb. 10, 2019, 4:10 p.m. UTC
Hi!

We had a problem report for code attempting to implement a vector right-shift for a
vector long long (V2DImode) type.  The programmer noted that we don't have a vector
splat-immediate for this mode, but cleverly realized he could use a vector char
splat-immediate since only the lower 6 bits of the shift count are read.  This is a
documented feature of both the vector shift built-ins and the underlying instruction.

Starting with GCC 8, the vector shifts are expanded early in rs6000_gimple_fold_builtin.
However, the GIMPLE folding does not currently perform the required TRUNC_MOD_EXPR to
implement the built-in semantics.  It appears that this was caught earlier for vector
shift-left and fixed, but the same problem was not fixed for vector shift-right.
This patch fixes that.

While fixing that problem, I noted that we get inferior code generation when we try
to fold the vector char splat earlier, due to the type mismatch and some additional
optimizations performed in the middle end.  Because this is a rare circumstance, it
makes sense to avoid the GIMPLE folding in this case and allow the back end to do
the expansion; this produces the clean code we're expecting with the vspltisb intact.

I've added executable tests for both shift-right algebraic and shift-right logical.
Both fail prior to applying the patch, and work correctly afterwards.

Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no regressions.  Is
this okay for trunk, and for GCC 8.3 if there is no fallout by the end of the
week?

Thanks,
Bill


[gcc]

2019-02-08  Bill Schmidt  <wschmidt@linux.ibm.com>

	* config/rs6000/rs6000.c (rs6000_gimple_fold_builtin): Shift-right
	and shift-left vector built-ins need to include a TRUNC_MOD_EXPR
	for correct semantics.  Also, don't expand a vector-splat if there
	is a type mismatch; let the back end handle it.

[gcc/testsuite]

2019-02-08  Bill Schmidt  <wschmidt@linux.ibm.com>

	* gcc.target/powerpc/srad-modulo.c: New.
	* gcc.target/powerpc/srd-modulo.c: New.

Comments

Segher Boessenkool Feb. 10, 2019, 10:05 p.m. UTC | #1
Hi Bill,

On Sun, Feb 10, 2019 at 10:10:02AM -0600, Bill Schmidt wrote:
> I've added executable tests for both shift-right algebraic and shift-right logical.
> Both fail prior to applying the patch, and work correctly afterwards.

Please add a test for left shifts, as well?

> 2019-02-08  Bill Schmidt  <wschmidt@linux.ibm.com>
> 
> 	* config/rs6000/rs6000.c (rs6000_gimple_fold_builtin): Shift-right
> 	and shift-left vector built-ins need to include a TRUNC_MOD_EXPR
> 	for correct semantics.  Also, don't expand a vector-splat if there
> 	is a type mismatch; let the back end handle it.

Does it always result in just the shift instruction now?  Does the modulo
ever remain?  (Maybe at -O0?)  Modulo is hugely expensive; it is always
modulo by a power of two, so a simple bitmask, so maybe write that directly
instead?

> 2019-02-08  Bill Schmidt  <wschmidt@linux.ibm.com>
> 
> 	* gcc.target/powerpc/srad-modulo.c: New.
> 	* gcc.target/powerpc/srd-modulo.c: New.

Please put "vec-" in the testcase name.  You may need to rename vec-shift.c
and/or vec-shr.c, which are not as generic as their names suggest.

> @@ -16072,6 +16116,13 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *
>  	arg0 = gimple_call_arg (stmt, 0);
>  	lhs = gimple_call_lhs (stmt);
>  
> +	/* It's sometimes useful to use one of these to build a
> +	   splat for V2DImode, since the upper bits will be ignored.
> +	   Don't fold if we detect that situation, as we end up
> +	   losing the splat instruction in this case.  */
> +	if (size != TREE_INT_CST_LOW (TYPE_SIZE (TREE_TYPE (TREE_TYPE (lhs)))))
> +	  return false;

This isn't really detecting that situation...  Instead, it doesn't fold
whenever the size of the splatted elements isn't the same as the size of
the elts of the target vector.  That's probably perfectly safe, but please
spell it out.  It's fine to mention the motivating case, of course.

> Index: gcc/testsuite/gcc.target/powerpc/srad-modulo.c
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/srad-modulo.c	(nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/srad-modulo.c	(working copy)
> @@ -0,0 +1,43 @@
> +/* Test that using a character splat to set up a shift-right algebraic
> +   for a doubleword vector works correctly after gimple folding.  */
> +
> +/* { dg-do run { target { powerpc64*-*-* && vsx_hw } } } */
> +/* { dg-options { "-O3" } } */

powerpc64*-*-* is almost always wrong.  I don't think you need to limit
to 64-bit at all here, but if you do, test for lp64 instead.

Testing for vsx_hw but not enabling vsx is probably wrong, too.

Does it need -O3, does -O2 not work?

Should this testcase check expected machine code as well?

> --- gcc/testsuite/gcc.target/powerpc/srd-modulo.c	(nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/srd-modulo.c	(working copy)

> +vui64_t
> +test_sradi_4 (vui64_t a)
> +{
> +  return vec_sradi (a, 4);
> +}

Pasto?  (srad vs. srd).


Segher
Bill Schmidt Feb. 11, 2019, 2:42 a.m. UTC | #2
On 2/10/19 4:05 PM, Segher Boessenkool wrote:
> Hi Bill,
>
> On Sun, Feb 10, 2019 at 10:10:02AM -0600, Bill Schmidt wrote:
>> I've added executable tests for both shift-right algebraic and shift-right logical.
>> Both fail prior to applying the patch, and work correctly afterwards.
> Please add a test for left shifts, as well?

Can do.  I verified that left shifts were not broken and figured a test case
had been added then, but have not checked.  Good to test this particular
scenario, though.

>
>> 2019-02-08  Bill Schmidt  <wschmidt@linux.ibm.com>
>>
>> 	* config/rs6000/rs6000.c (rs6000_gimple_fold_builtin): Shift-right
>> 	and shift-left vector built-ins need to include a TRUNC_MOD_EXPR
>> 	for correct semantics.  Also, don't expand a vector-splat if there
>> 	is a type mismatch; let the back end handle it.
> Does it always result in just the shift instruction now?  Does the modulo
> ever remain?  (Maybe at -O0?)  Modulo is hugely expensive; it is always
> modulo by a power of two, so a simple bitmask, so maybe write that directly
> instead?

We always get the shift.  For -mcpu=power8, we always load the mask from
memory rather than generating the vspltisb, which is not ideal code generation,
but is at least correct.

For -mcpu=power9, we get close, but have some bad register allocation and
an unnecessary extend:

        xxspltib 0,4   <- why not just xxspltib 32,4?
        xxlor 32,0,0   <- wasted copy
        vextsb2d 0,0   <- unnecessary due to vsrad semantics
        vsrad 2,2,0

Again, this is at least correct.  We have more work to do to produce the
most efficient code, but we have PR89213 open for that.

>
>> 2019-02-08  Bill Schmidt  <wschmidt@linux.ibm.com>
>>
>> 	* gcc.target/powerpc/srad-modulo.c: New.
>> 	* gcc.target/powerpc/srd-modulo.c: New.
> Please put "vec-" in the testcase name.  You may need to rename vec-shift.c
> and/or vec-shr.c, which are not as generic as their names suggest.

Ok.

>
>> @@ -16072,6 +16116,13 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *
>>  	arg0 = gimple_call_arg (stmt, 0);
>>  	lhs = gimple_call_lhs (stmt);
>>  
>> +	/* It's sometimes useful to use one of these to build a
>> +	   splat for V2DImode, since the upper bits will be ignored.
>> +	   Don't fold if we detect that situation, as we end up
>> +	   losing the splat instruction in this case.  */
>> +	if (size != TREE_INT_CST_LOW (TYPE_SIZE (TREE_TYPE (TREE_TYPE (lhs)))))
>> +	  return false;
> This isn't really detecting that situation...  Instead, it doesn't fold
> whenever the size of the splatted elements isn't the same as the size of
> the elts of the target vector.  That's probably perfectly safe, but please
> spell it out.  It's fine to mention the motivating case, of course.

Yep, will correct.  Actually, as I look back at my notes, I believe that this
change is not necessary after all (same code generated with and without it).
I'll verify.

>
>> Index: gcc/testsuite/gcc.target/powerpc/srad-modulo.c
>> ===================================================================
>> --- gcc/testsuite/gcc.target/powerpc/srad-modulo.c	(nonexistent)
>> +++ gcc/testsuite/gcc.target/powerpc/srad-modulo.c	(working copy)
>> @@ -0,0 +1,43 @@
>> +/* Test that using a character splat to set up a shift-right algebraic
>> +   for a doubleword vector works correctly after gimple folding.  */
>> +
>> +/* { dg-do run { target { powerpc64*-*-* && vsx_hw } } } */
>> +/* { dg-options { "-O3" } } */
> powerpc64*-*-* is almost always wrong.  I don't think you need to limit
> to 64-bit at all here, but if you do, test for lp64 instead.

Ok.

>
> Testing for vsx_hw but not enabling vsx is probably wrong, too.

Weird.  I just tried adding -mvsx and I get this peculiar error we've seen
before about AMD graphics card offloading:

spawn -ignore SIGHUP /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ /home/wschmidt/gcc/gcc-mainline-t\
est2/gcc/testsuite/gcc.target/powerpc/srad-modulo.c -fno-diagnostics-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -O2 -mvsx -lm -o \
./srad-modulo.exe^M
^[[01m^[[Kcc1:^[[m^[[K ^[[01;31m^[[Kerror: ^[[m^[[Kargument to '^[[01m^[[K-O^[[m^[[K' should be a non-negative integer, '^[[01m^[[Kg^[[m^[[K', '^[[01m^[[Ks^[[\
m^[[K' or '^[[01m^[[Kfast^[[m^[[K'^M
compiler exited with status 1
Executing on host: /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ offload_gcn7262.c    -fno-diagnosti\
cs-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never  -foffload=amdgcn-unknown-amdhsa -S -o offload_gcn7262.s    (timeout = 300)
spawn -ignore SIGHUP /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ offload_gcn7262.c -fno-diagnostic\
s-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -foffload=amdgcn-unknown-amdhsa -S -o offload_gcn7262.s^M
xgcc: fatal error: GCC is not configured to support amdgcn-unknown-amdhsa as offload target^M
compilation terminated.^M
compiler exited with status 1
FAIL: gcc.target/powerpc/srad-modulo.c (test for excess errors)
Excess errors:
^[[01m^[[Kcc1:^[[m^[[K ^[[01;31m^[[Kerror: ^[[m^[[Kargument to '^[[01m^[[K-O^[[m^[[K' should be a non-negative integer, '^[[01m^[[Kg^[[m^[[K', '^[[01m^[[Ks^[[\
m^[[K' or '^[[01m^[[Kfast^[[m^[[K'

Any ideas what's causing this?  I can't test with -mvsx until it's fixed.
Do we still have a bug open on this?

>
> Does it need -O3, does -O2 not work?

-O2 is fine, can change to that.

>
> Should this testcase check expected machine code as well?

I think we should defer that to the fix for PR89213.  We aren't at
ideal code quite yet.

>
>> --- gcc/testsuite/gcc.target/powerpc/srd-modulo.c	(nonexistent)
>> +++ gcc/testsuite/gcc.target/powerpc/srd-modulo.c	(working copy)
>> +vui64_t
>> +test_sradi_4 (vui64_t a)
>> +{
>> +  return vec_sradi (a, 4);
>> +}
> Pasto?  (srad vs. srd).

Ecch.  Will fix.  

>
>
> Segher
>
Bill Schmidt Feb. 11, 2019, 2:56 a.m. UTC | #3
On 2/10/19 8:42 PM, Bill Schmidt wrote:
> On 2/10/19 4:05 PM, Segher Boessenkool wrote:
>> Hi Bill,
>>
>> On Sun, Feb 10, 2019 at 10:10:02AM -0600, Bill Schmidt wrote:
>>> I've added executable tests for both shift-right algebraic and shift-right logical.
>>> Both fail prior to applying the patch, and work correctly afterwards.
>> Please add a test for left shifts, as well?
> Can do.  I verified that left shifts were not broken and figured a test case
> had been added then, but have not checked.  Good to test this particular
> scenario, though.
>
>>> 2019-02-08  Bill Schmidt  <wschmidt@linux.ibm.com>
>>>
>>> 	* config/rs6000/rs6000.c (rs6000_gimple_fold_builtin): Shift-right
>>> 	and shift-left vector built-ins need to include a TRUNC_MOD_EXPR
>>> 	for correct semantics.  Also, don't expand a vector-splat if there
>>> 	is a type mismatch; let the back end handle it.
>> Does it always result in just the shift instruction now?  Does the modulo
>> ever remain?  (Maybe at -O0?)  Modulo is hugely expensive; it is always
>> modulo by a power of two, so a simple bitmask, so maybe write that directly
>> instead?
> We always get the shift.  For -mcpu=power8, we always load the mask from
> memory rather than generating the vspltisb, which is not ideal code generation,
> but is at least correct.
>
> For -mcpu=power9, we get close, but have some bad register allocation and
> an unnecessary extend:
>
>         xxspltib 0,4   <- why not just xxspltib 32,4?
>         xxlor 32,0,0   <- wasted copy
>         vextsb2d 0,0   <- unnecessary due to vsrad semantics
>         vsrad 2,2,0
>
> Again, this is at least correct.  We have more work to do to produce the
> most efficient code, but we have PR89213 open for that.
>
>>> 2019-02-08  Bill Schmidt  <wschmidt@linux.ibm.com>
>>>
>>> 	* gcc.target/powerpc/srad-modulo.c: New.
>>> 	* gcc.target/powerpc/srd-modulo.c: New.
>> Please put "vec-" in the testcase name.  You may need to rename vec-shift.c
>> and/or vec-shr.c, which are not as generic as their names suggest.
> Ok.
>
>>> @@ -16072,6 +16116,13 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *
>>>  	arg0 = gimple_call_arg (stmt, 0);
>>>  	lhs = gimple_call_lhs (stmt);
>>>  
>>> +	/* It's sometimes useful to use one of these to build a
>>> +	   splat for V2DImode, since the upper bits will be ignored.
>>> +	   Don't fold if we detect that situation, as we end up
>>> +	   losing the splat instruction in this case.  */
>>> +	if (size != TREE_INT_CST_LOW (TYPE_SIZE (TREE_TYPE (TREE_TYPE (lhs)))))
>>> +	  return false;
>> This isn't really detecting that situation...  Instead, it doesn't fold
>> whenever the size of the splatted elements isn't the same as the size of
>> the elts of the target vector.  That's probably perfectly safe, but please
>> spell it out.  It's fine to mention the motivating case, of course.
> Yep, will correct.  Actually, as I look back at my notes, I believe that this
> change is not necessary after all (same code generated with and without it).
> I'll verify.
>
>>> Index: gcc/testsuite/gcc.target/powerpc/srad-modulo.c
>>> ===================================================================
>>> --- gcc/testsuite/gcc.target/powerpc/srad-modulo.c	(nonexistent)
>>> +++ gcc/testsuite/gcc.target/powerpc/srad-modulo.c	(working copy)
>>> @@ -0,0 +1,43 @@
>>> +/* Test that using a character splat to set up a shift-right algebraic
>>> +   for a doubleword vector works correctly after gimple folding.  */
>>> +
>>> +/* { dg-do run { target { powerpc64*-*-* && vsx_hw } } } */
>>> +/* { dg-options { "-O3" } } */
>> powerpc64*-*-* is almost always wrong.  I don't think you need to limit
>> to 64-bit at all here, but if you do, test for lp64 instead.
> Ok.
>
>> Testing for vsx_hw but not enabling vsx is probably wrong, too.
> Weird.  I just tried adding -mvsx and I get this peculiar error we've seen
> before about AMD graphics card offloading:
>
> spawn -ignore SIGHUP /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ /home/wschmidt/gcc/gcc-mainline-t\
> est2/gcc/testsuite/gcc.target/powerpc/srad-modulo.c -fno-diagnostics-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -O2 -mvsx -lm -o \
> ./srad-modulo.exe^M
> ^[[01m^[[Kcc1:^[[m^[[K ^[[01;31m^[[Kerror: ^[[m^[[Kargument to '^[[01m^[[K-O^[[m^[[K' should be a non-negative integer, '^[[01m^[[Kg^[[m^[[K', '^[[01m^[[Ks^[[\
> m^[[K' or '^[[01m^[[Kfast^[[m^[[K'^M
> compiler exited with status 1
> Executing on host: /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ offload_gcn7262.c    -fno-diagnosti\
> cs-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never  -foffload=amdgcn-unknown-amdhsa -S -o offload_gcn7262.s    (timeout = 300)
> spawn -ignore SIGHUP /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ offload_gcn7262.c -fno-diagnostic\
> s-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -foffload=amdgcn-unknown-amdhsa -S -o offload_gcn7262.s^M
> xgcc: fatal error: GCC is not configured to support amdgcn-unknown-amdhsa as offload target^M
> compilation terminated.^M
> compiler exited with status 1
> FAIL: gcc.target/powerpc/srad-modulo.c (test for excess errors)
> Excess errors:
> ^[[01m^[[Kcc1:^[[m^[[K ^[[01;31m^[[Kerror: ^[[m^[[Kargument to '^[[01m^[[K-O^[[m^[[K' should be a non-negative integer, '^[[01m^[[Kg^[[m^[[K', '^[[01m^[[Ks^[[\
> m^[[K' or '^[[01m^[[Kfast^[[m^[[K'
>
> Any ideas what's causing this?  I can't test with -mvsx until it's fixed.
> Do we still have a bug open on this?

Looks like this was PR88920, and apparently this silly offload thing pops up 
once per testsuite run if there's a failure of any kind.  I don't consider just
caching it so it only happens once to be an adequate fix... this shouldn't be
happening at all and is highly misleading.

The actual problem is that I've got an extra set of braces in the dg-options
directive... oops...

Thanks,
Bill


>
>> Does it need -O3, does -O2 not work?
> -O2 is fine, can change to that.
>
>> Should this testcase check expected machine code as well?
> I think we should defer that to the fix for PR89213.  We aren't at
> ideal code quite yet.
>
>>> --- gcc/testsuite/gcc.target/powerpc/srd-modulo.c	(nonexistent)
>>> +++ gcc/testsuite/gcc.target/powerpc/srd-modulo.c	(working copy)
>>> +vui64_t
>>> +test_sradi_4 (vui64_t a)
>>> +{
>>> +  return vec_sradi (a, 4);
>>> +}
>> Pasto?  (srad vs. srd).
> Ecch.  Will fix.  
>
>>
>> Segher
>>
Segher Boessenkool Feb. 11, 2019, 12:43 p.m. UTC | #4
On Sun, Feb 10, 2019 at 08:42:42PM -0600, Bill Schmidt wrote:
> On 2/10/19 4:05 PM, Segher Boessenkool wrote:
> > On Sun, Feb 10, 2019 at 10:10:02AM -0600, Bill Schmidt wrote:
> >> I've added executable tests for both shift-right algebraic and shift-right logical.
> >> Both fail prior to applying the patch, and work correctly afterwards.
> > Please add a test for left shifts, as well?
> 
> Can do.  I verified that left shifts were not broken and figured a test case
> had been added then, but have not checked.  Good to test this particular
> scenario, though.

Well you actually added code for left shifts, too ;-)

> >> 2019-02-08  Bill Schmidt  <wschmidt@linux.ibm.com>
> >>
> >> 	* config/rs6000/rs6000.c (rs6000_gimple_fold_builtin): Shift-right
> >> 	and shift-left vector built-ins need to include a TRUNC_MOD_EXPR
> >> 	for correct semantics.  Also, don't expand a vector-splat if there
> >> 	is a type mismatch; let the back end handle it.
> > Does it always result in just the shift instruction now?  Does the modulo
> > ever remain?  (Maybe at -O0?)  Modulo is hugely expensive; it is always
> > modulo by a power of two, so a simple bitmask, so maybe write that directly
> > instead?
> 
> We always get the shift.  For -mcpu=power8, we always load the mask from
> memory rather than generating the vspltisb, which is not ideal code generation,
> but is at least correct.

_Just_ the shift, I meant.  I know we load a constant from memory, for
constants; I am worried about non-constants, do those get a modulo, or
just a mask.  Also at -O0; if we get an actual division at -O0, there
probably are other cases where it isn't optimised away as well.

> For -mcpu=power9, we get close, but have some bad register allocation and
> an unnecessary extend:
> 
>         xxspltib 0,4   <- why not just xxspltib 32,4?
>         xxlor 32,0,0   <- wasted copy

Yeah, huh.  Where does that come from...  I blame splitters after reload.

>         vextsb2d 0,0   <- unnecessary due to vsrad semantics
>         vsrad 2,2,0
> 
> Again, this is at least correct.  We have more work to do to produce the
> most efficient code, but we have PR89213 open for that.

Yes.  But I'd like to see at least a mask instead of a modulo.

> >> +	/* It's sometimes useful to use one of these to build a
> >> +	   splat for V2DImode, since the upper bits will be ignored.
> >> +	   Don't fold if we detect that situation, as we end up
> >> +	   losing the splat instruction in this case.  */
> >> +	if (size != TREE_INT_CST_LOW (TYPE_SIZE (TREE_TYPE (TREE_TYPE (lhs)))))
> >> +	  return false;
> > This isn't really detecting that situation...  Instead, it doesn't fold
> > whenever the size of the splatted elements isn't the same as the size of
> > the elts of the target vector.  That's probably perfectly safe, but please
> > spell it out.  It's fine to mention the motivating case, of course.
> 
> Yep, will correct.  Actually, as I look back at my notes, I believe that this
> change is not necessary after all (same code generated with and without it).
> I'll verify.

Ha, that would be nice :-)

> > Testing for vsx_hw but not enabling vsx is probably wrong, too.
> 
> Weird.  I just tried adding -mvsx

Does it _need_ VSX anyway?  Are these builtins defined without it, too?

> and I get this peculiar error we've seen
> before about AMD graphics card offloading:
> 
> spawn -ignore SIGHUP /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ /home/wschmidt/gcc/gcc-mainline-t\
> est2/gcc/testsuite/gcc.target/powerpc/srad-modulo.c -fno-diagnostics-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -O2 -mvsx -lm -o \
> ./srad-modulo.exe^M
> ^[[01m^[[Kcc1:^[[m^[[K ^[[01;31m^[[Kerror: ^[[m^[[Kargument to '^[[01m^[[K-O^[[m^[[K' should be a non-negative integer, '^[[01m^[[Kg^[[m^[[K', '^[[01m^[[Ks^[[\
> m^[[K' or '^[[01m^[[Kfast^[[m^[[K'^M
> compiler exited with status 1

Huh, why do you get colour escapes at all?  I thought the testsuite used
GCC_COLORS= to get rid of that nonsense.  It's quite unreadable like this.

> Executing on host: /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ offload_gcn7262.c    -fno-diagnosti\
> cs-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never  -foffload=amdgcn-unknown-amdhsa -S -o offload_gcn7262.s    (timeout = 300)
> spawn -ignore SIGHUP /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ offload_gcn7262.c -fno-diagnostic\
> s-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -foffload=amdgcn-unknown-amdhsa -S -o offload_gcn7262.s^M
> xgcc: fatal error: GCC is not configured to support amdgcn-unknown-amdhsa as offload target^M
> compilation terminated.^M
> compiler exited with status 1

This is a test to see if there is offload stuff.  There isn't.

> FAIL: gcc.target/powerpc/srad-modulo.c (test for excess errors)
> Excess errors:
> ^[[01m^[[Kcc1:^[[m^[[K ^[[01;31m^[[Kerror: ^[[m^[[Kargument to '^[[01m^[[K-O^[[m^[[K' should be a non-negative integer, '^[[01m^[[Kg^[[m^[[K', '^[[01m^[[Ks^[[\
> m^[[K' or '^[[01m^[[Kfast^[[m^[[K'
> 
> Any ideas what's causing this?  I can't test with -mvsx until it's fixed.
> Do we still have a bug open on this?

I have never seen this before, not in bugzilla, not in my own testing (I do
however almost always use --disable-libgomp, it breaks way too often,
impeding all other progress).

> > Does it need -O3, does -O2 not work?
> 
> -O2 is fine, can change to that.

Please do.

> > Should this testcase check expected machine code as well?
> 
> I think we should defer that to the fix for PR89213.  We aren't at
> ideal code quite yet.

Gotcha.


Segher
Bill Schmidt Feb. 11, 2019, 1:17 p.m. UTC | #5
On 2/11/19 6:43 AM, Segher Boessenkool wrote:
> On Sun, Feb 10, 2019 at 08:42:42PM -0600, Bill Schmidt wrote:
>> On 2/10/19 4:05 PM, Segher Boessenkool wrote:
>>> On Sun, Feb 10, 2019 at 10:10:02AM -0600, Bill Schmidt wrote:
>>>> I've added executable tests for both shift-right algebraic and shift-right logical.
>>>> Both fail prior to applying the patch, and work correctly afterwards.
>>> Please add a test for left shifts, as well?
>> Can do.  I verified that left shifts were not broken and figured a test case
>> had been added then, but have not checked.  Good to test this particular
>> scenario, though.
> Well you actually added code for left shifts, too ;-)

Nope. :-)  It just looks like it because of the way the diff hunks come out.
The changes are to shift-right-logical and shift-right-algebraic.  But it
still doesn't hurt to add the test.

>
>>>> 2019-02-08  Bill Schmidt  <wschmidt@linux.ibm.com>
>>>>
>>>> 	* config/rs6000/rs6000.c (rs6000_gimple_fold_builtin): Shift-right
>>>> 	and shift-left vector built-ins need to include a TRUNC_MOD_EXPR
>>>> 	for correct semantics.  Also, don't expand a vector-splat if there
>>>> 	is a type mismatch; let the back end handle it.
>>> Does it always result in just the shift instruction now?  Does the modulo
>>> ever remain?  (Maybe at -O0?)  Modulo is hugely expensive; it is always
>>> modulo by a power of two, so a simple bitmask, so maybe write that directly
>>> instead?
>> We always get the shift.  For -mcpu=power8, we always load the mask from
>> memory rather than generating the vspltisb, which is not ideal code generation,
>> but is at least correct.
> _Just_ the shift, I meant.  I know we load a constant from memory, for
> constants; I am worried about non-constants, do those get a modulo, or
> just a mask.  Also at -O0; if we get an actual division at -O0, there
> probably are other cases where it isn't optimised away as well.

Even at -O1 this is all folded away in GIMPLE:

test_sradi_4 (vi64_t a)
{
  vi64_t result;

  <bb 2> [local count: 1073741824]:
  result_3 = a_2(D) >> 4;
  return result_3;

}

At -O0 (if I hand-inline everything myself to avoid errors), we scalarize
the modulo/masking operation into a rldicl for each doubleword.  I really
don't see any reason to change the code.

>
>> For -mcpu=power9, we get close, but have some bad register allocation and
>> an unnecessary extend:
>>
>>         xxspltib 0,4   <- why not just xxspltib 32,4?
>>         xxlor 32,0,0   <- wasted copy
> Yeah, huh.  Where does that come from...  I blame splitters after reload.

This only happens at -O2 and up, FWIW.  At -O1 we allocate the registers
reasonably.

>
>>         vextsb2d 0,0   <- unnecessary due to vsrad semantics
>>         vsrad 2,2,0
>>
>> Again, this is at least correct.  We have more work to do to produce the
>> most efficient code, but we have PR89213 open for that.
> Yes.  But I'd like to see at least a mask instead of a modulo.

Again, there is no modulo by the time we exit GIMPLE.  Our efficiency
failings are all in the back end.

>
>>>> +	/* It's sometimes useful to use one of these to build a
>>>> +	   splat for V2DImode, since the upper bits will be ignored.
>>>> +	   Don't fold if we detect that situation, as we end up
>>>> +	   losing the splat instruction in this case.  */
>>>> +	if (size != TREE_INT_CST_LOW (TYPE_SIZE (TREE_TYPE (TREE_TYPE (lhs)))))
>>>> +	  return false;
>>> This isn't really detecting that situation...  Instead, it doesn't fold
>>> whenever the size of the splatted elements isn't the same as the size of
>>> the elts of the target vector.  That's probably perfectly safe, but please
>>> spell it out.  It's fine to mention the motivating case, of course.
>> Yep, will correct.  Actually, as I look back at my notes, I believe that this
>> change is not necessary after all (same code generated with and without it).
>> I'll verify.
> Ha, that would be nice :-)

Confirmed, BTW.

>
>>> Testing for vsx_hw but not enabling vsx is probably wrong, too.
>> Weird.  I just tried adding -mvsx
> Does it _need_ VSX anyway?  Are these builtins defined without it, too?

Yes (vector long long / V2DImode requires VSX).

>
>> and I get this peculiar error we've seen
>> before about AMD graphics card offloading:
>>
>> spawn -ignore SIGHUP /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ /home/wschmidt/gcc/gcc-mainline-t\
>> est2/gcc/testsuite/gcc.target/powerpc/srad-modulo.c -fno-diagnostics-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -O2 -mvsx -lm -o \
>> ./srad-modulo.exe^M
>> ^[[01m^[[Kcc1:^[[m^[[K ^[[01;31m^[[Kerror: ^[[m^[[Kargument to '^[[01m^[[K-O^[[m^[[K' should be a non-negative integer, '^[[01m^[[Kg^[[m^[[K', '^[[01m^[[Ks^[[\
>> m^[[K' or '^[[01m^[[Kfast^[[m^[[K'^M
>> compiler exited with status 1
> Huh, why do you get colour escapes at all?  I thought the testsuite used
> GCC_COLORS= to get rid of that nonsense.  It's quite unreadable like this.

Yep, that's why it took me a while to figure out what was going on. ;-)  It was
objecting to "-O3 -mvsx" looking like one argument because of an extra set of
parens in my dg-options directive.

>
>> Executing on host: /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ offload_gcn7262.c    -fno-diagnosti\
>> cs-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never  -foffload=amdgcn-unknown-amdhsa -S -o offload_gcn7262.s    (timeout = 300)
>> spawn -ignore SIGHUP /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ offload_gcn7262.c -fno-diagnostic\
>> s-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -foffload=amdgcn-unknown-amdhsa -S -o offload_gcn7262.s^M
>> xgcc: fatal error: GCC is not configured to support amdgcn-unknown-amdhsa as offload target^M
>> compilation terminated.^M
>> compiler exited with status 1
> This is a test to see if there is offload stuff.  There isn't.

Which should never be run for RUNTESTFLAGS="powerpc.exp=vec-srad.c" ... something
is really hosed up.

>
>> FAIL: gcc.target/powerpc/srad-modulo.c (test for excess errors)
>> Excess errors:
>> ^[[01m^[[Kcc1:^[[m^[[K ^[[01;31m^[[Kerror: ^[[m^[[Kargument to '^[[01m^[[K-O^[[m^[[K' should be a non-negative integer, '^[[01m^[[Kg^[[m^[[K', '^[[01m^[[Ks^[[\
>> m^[[K' or '^[[01m^[[Kfast^[[m^[[K'
>>
>> Any ideas what's causing this?  I can't test with -mvsx until it's fixed.
>> Do we still have a bug open on this?
> I have never seen this before, not in bugzilla, not in my own testing (I do
> however almost always use --disable-libgomp, it breaks way too often,
> impeding all other progress).

I pointed to the bugzilla in another reply -- which was "resolved" with a hack.
I consider it still broken this way...

I tested a revised version of the patch overnight and will submit shortly.

Bill

>
>>> Does it need -O3, does -O2 not work?
>> -O2 is fine, can change to that.
> Please do.
>
>>> Should this testcase check expected machine code as well?
>> I think we should defer that to the fix for PR89213.  We aren't at
>> ideal code quite yet.
> Gotcha.
>
>
> Segher
>
Segher Boessenkool Feb. 11, 2019, 2:11 p.m. UTC | #6
On Mon, Feb 11, 2019 at 07:17:16AM -0600, Bill Schmidt wrote:
> At -O0 (if I hand-inline everything myself to avoid errors), we scalarize
> the modulo/masking operation into a rldicl for each doubleword.  I really
> don't see any reason to change the code.

So what does this look like at expand (at -O0)?  Is it something that
is done at gimple level, is it expand itself, is it some target thing?

> >> For -mcpu=power9, we get close, but have some bad register allocation and
> >> an unnecessary extend:
> >>
> >>         xxspltib 0,4   <- why not just xxspltib 32,4?
> >>         xxlor 32,0,0   <- wasted copy
> > Yeah, huh.  Where does that come from...  I blame splitters after reload.
> 
> This only happens at -O2 and up, FWIW.  At -O1 we allocate the registers
> reasonably.

Heh.

> >> Weird.  I just tried adding -mvsx
> > Does it _need_ VSX anyway?  Are these builtins defined without it, too?
> 
> Yes (vector long long / V2DImode requires VSX).

So something like

/* { dg-do run } */
/* { dg-require-effective-target vsx_hw } */
/* { dg-options "-mvsx -O2" } */

then?

> I pointed to the bugzilla in another reply -- which was "resolved" with a hack.
> I consider it still broken this way...

Reopen that PR?

> I tested a revised version of the patch overnight and will submit shortly.

Thanks.


Segher
Bill Schmidt Feb. 11, 2019, 2:35 p.m. UTC | #7
On 2/11/19 8:11 AM, Segher Boessenkool wrote:
> On Mon, Feb 11, 2019 at 07:17:16AM -0600, Bill Schmidt wrote:
>> At -O0 (if I hand-inline everything myself to avoid errors), we scalarize
>> the modulo/masking operation into a rldicl for each doubleword.  I really
>> don't see any reason to change the code.
> So what does this look like at expand (at -O0)?  Is it something that
> is done at gimple level, is it expand itself, is it some target thing?

It's already a mask at expand, even at -O0.  vregs dump excerpt:

(insn 13 12 14 2 (set (reg:DI 129 [ _16 ])
        (vec_select:DI (reg:V2DI 126 [ _9 ])
            (parallel [
                    (const_int 0 [0])
                ]))) "vec-srad-modulo.c":40:10 1223 {vsx_extract_v2di}
     (nil))
(insn 14 13 15 2 (set (reg:DI 130 [ _17 ])
        (and:DI (reg:DI 129 [ _16 ])
            (const_int 63 [0x3f]))) "vec-srad-modulo.c":40:10 195 {anddi3_mask}
     (nil))
(insn 15 14 16 2 (set (reg:DI 131 [ _18 ])
        (vec_select:DI (reg:V2DI 126 [ _9 ])
            (parallel [
                    (const_int 1 [0x1])
                ]))) "vec-srad-modulo.c":40:10 1223 {vsx_extract_v2di}
     (nil))
(insn 16 15 17 2 (set (reg:DI 132 [ _19 ])
        (and:DI (reg:DI 131 [ _18 ])
            (const_int 63 [0x3f]))) "vec-srad-modulo.c":40:10 195 {anddi3_mask}
     (nil))

>
>>>> For -mcpu=power9, we get close, but have some bad register allocation and
>>>> an unnecessary extend:
>>>>
>>>>         xxspltib 0,4   <- why not just xxspltib 32,4?
>>>>         xxlor 32,0,0   <- wasted copy
>>> Yeah, huh.  Where does that come from...  I blame splitters after reload.
>> This only happens at -O2 and up, FWIW.  At -O1 we allocate the registers
>> reasonably.
> Heh.
>
>>>> Weird.  I just tried adding -mvsx
>>> Does it _need_ VSX anyway?  Are these builtins defined without it, too?
>> Yes (vector long long / V2DImode requires VSX).
> So something like
>
> /* { dg-do run } */
> /* { dg-require-effective-target vsx_hw } */
> /* { dg-options "-mvsx -O2" } */
>
> then?

What I have now:

/* { dg-do run { target { vsx_hw } } } */                                      
/* { dg-options "-O2 -mvsx" } */                                               

>
>> I pointed to the bugzilla in another reply -- which was "resolved" with a hack.
>> I consider it still broken this way...
> Reopen that PR?

Already done. ;-)

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88920

Bill

>
>> I tested a revised version of the patch overnight and will submit shortly.
> Thanks.
>
>
> Segher
>
diff mbox series

Patch

Index: gcc/config/rs6000/rs6000.c
===================================================================
--- gcc/config/rs6000/rs6000.c	(revision 268707)
+++ gcc/config/rs6000/rs6000.c	(working copy)
@@ -15735,13 +15735,37 @@  rs6000_gimple_fold_builtin (gimple_stmt_iterator *
     case ALTIVEC_BUILTIN_VSRAH:
     case ALTIVEC_BUILTIN_VSRAW:
     case P8V_BUILTIN_VSRAD:
-      arg0 = gimple_call_arg (stmt, 0);
-      arg1 = gimple_call_arg (stmt, 1);
-      lhs = gimple_call_lhs (stmt);
-      g = gimple_build_assign (lhs, RSHIFT_EXPR, arg0, arg1);
-      gimple_set_location (g, gimple_location (stmt));
-      gsi_replace (gsi, g, true);
-      return true;
+      {
+	arg0 = gimple_call_arg (stmt, 0);
+	arg1 = gimple_call_arg (stmt, 1);
+	lhs = gimple_call_lhs (stmt);
+	tree arg1_type = TREE_TYPE (arg1);
+	tree unsigned_arg1_type = unsigned_type_for (TREE_TYPE (arg1));
+	tree unsigned_element_type = unsigned_type_for (TREE_TYPE (arg1_type));
+	location_t loc = gimple_location (stmt);
+	/* Force arg1 into the range valid matching the arg0 type.  */
+	/* Build a vector consisting of the max valid bit-size values.  */
+	int n_elts = VECTOR_CST_NELTS (arg1);
+	tree element_size = build_int_cst (unsigned_element_type,
+					   128 / n_elts);
+	tree_vector_builder elts (unsigned_arg1_type, n_elts, 1);
+	for (int i = 0; i < n_elts; i++)
+	  elts.safe_push (element_size);
+	tree modulo_tree = elts.build ();
+	/* Modulo the provided shift value against that vector.  */
+	gimple_seq stmts = NULL;
+	tree unsigned_arg1 = gimple_build (&stmts, VIEW_CONVERT_EXPR,
+					   unsigned_arg1_type, arg1);
+	tree new_arg1 = gimple_build (&stmts, loc, TRUNC_MOD_EXPR,
+				      unsigned_arg1_type, unsigned_arg1,
+				      modulo_tree);
+	gsi_insert_seq_before (gsi, stmts, GSI_SAME_STMT);
+	/* And finally, do the shift.  */
+	g = gimple_build_assign (lhs, RSHIFT_EXPR, arg0, new_arg1);
+	gimple_set_location (g, loc);
+	gsi_replace (gsi, g, true);
+	return true;
+      }
    /* Flavors of vector shift left.
       builtin_altivec_vsl{b,h,w} -> vsl{b,h,w}.  */
     case ALTIVEC_BUILTIN_VSLB:
@@ -15795,14 +15819,34 @@  rs6000_gimple_fold_builtin (gimple_stmt_iterator *
 	arg0 = gimple_call_arg (stmt, 0);
 	arg1 = gimple_call_arg (stmt, 1);
 	lhs = gimple_call_lhs (stmt);
+	tree arg1_type = TREE_TYPE (arg1);
+	tree unsigned_arg1_type = unsigned_type_for (TREE_TYPE (arg1));
+	tree unsigned_element_type = unsigned_type_for (TREE_TYPE (arg1_type));
+	location_t loc = gimple_location (stmt);
 	gimple_seq stmts = NULL;
 	/* Convert arg0 to unsigned.  */
 	tree arg0_unsigned
 	  = gimple_build (&stmts, VIEW_CONVERT_EXPR,
 			  unsigned_type_for (TREE_TYPE (arg0)), arg0);
+	/* Force arg1 into the range valid matching the arg0 type.  */
+	/* Build a vector consisting of the max valid bit-size values.  */
+	int n_elts = VECTOR_CST_NELTS (arg1);
+	tree element_size = build_int_cst (unsigned_element_type,
+					   128 / n_elts);
+	tree_vector_builder elts (unsigned_arg1_type, n_elts, 1);
+	for (int i = 0; i < n_elts; i++)
+	  elts.safe_push (element_size);
+	tree modulo_tree = elts.build ();
+	/* Modulo the provided shift value against that vector.  */
+	tree unsigned_arg1 = gimple_build (&stmts, VIEW_CONVERT_EXPR,
+					   unsigned_arg1_type, arg1);
+	tree new_arg1 = gimple_build (&stmts, loc, TRUNC_MOD_EXPR,
+				      unsigned_arg1_type, unsigned_arg1,
+				      modulo_tree);
+	/* Do the shift.  */
 	tree res
 	  = gimple_build (&stmts, RSHIFT_EXPR,
-			  TREE_TYPE (arg0_unsigned), arg0_unsigned, arg1);
+			  TREE_TYPE (arg0_unsigned), arg0_unsigned, new_arg1);
 	/* Convert result back to the lhs type.  */
 	res = gimple_build (&stmts, VIEW_CONVERT_EXPR, TREE_TYPE (lhs), res);
 	gsi_insert_seq_before (gsi, stmts, GSI_SAME_STMT);
@@ -16061,7 +16105,7 @@  rs6000_gimple_fold_builtin (gimple_stmt_iterator *
     case ALTIVEC_BUILTIN_VSPLTISH:
     case ALTIVEC_BUILTIN_VSPLTISW:
       {
-	int size;
+	unsigned int size;
 	if (fn_code == ALTIVEC_BUILTIN_VSPLTISB)
 	  size = 8;
 	else if (fn_code == ALTIVEC_BUILTIN_VSPLTISH)
@@ -16072,6 +16116,13 @@  rs6000_gimple_fold_builtin (gimple_stmt_iterator *
 	arg0 = gimple_call_arg (stmt, 0);
 	lhs = gimple_call_lhs (stmt);
 
+	/* It's sometimes useful to use one of these to build a
+	   splat for V2DImode, since the upper bits will be ignored.
+	   Don't fold if we detect that situation, as we end up
+	   losing the splat instruction in this case.  */
+	if (size != TREE_INT_CST_LOW (TYPE_SIZE (TREE_TYPE (TREE_TYPE (lhs)))))
+	  return false;
+
 	/* Only fold the vec_splat_*() if the lower bits of arg 0 is a
 	   5-bit signed constant in range -16 to +15.  */
 	if (TREE_CODE (arg0) != INTEGER_CST
Index: gcc/testsuite/gcc.target/powerpc/srad-modulo.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/srad-modulo.c	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/srad-modulo.c	(working copy)
@@ -0,0 +1,43 @@ 
+/* Test that using a character splat to set up a shift-right algebraic
+   for a doubleword vector works correctly after gimple folding.  */
+
+/* { dg-do run { target { powerpc64*-*-* && vsx_hw } } } */
+/* { dg-options { "-O3" } } */
+
+#include <altivec.h>
+
+typedef __vector unsigned long long vui64_t;
+typedef __vector long long vi64_t;
+
+static inline vi64_t
+vec_sradi (vi64_t vra, const unsigned int shb)
+{
+  vui64_t rshift;
+  vi64_t result;
+
+  /* Note legitimate use of wrong-type splat due to expectation that only
+     lower 6-bits are read.  */
+  rshift = (vui64_t) vec_splat_s8((const unsigned char)shb);
+
+  /* Vector Shift Right Algebraic Doublewords based on the lower 6-bits
+     of corresponding element of rshift.  */
+  result = vec_vsrad (vra, rshift);
+
+  return (vi64_t) result;
+}
+
+vi64_t
+test_sradi_4 (vi64_t a)
+{
+  return vec_sradi (a, 4);
+}
+
+int
+main ()
+{
+  vi64_t x = {-256, 1025};
+  x = test_sradi_4 (x);
+  if (x[0] != -16 || x[1] != 64)
+    __builtin_abort ();
+  return 0;
+}
Index: gcc/testsuite/gcc.target/powerpc/srd-modulo.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/srd-modulo.c	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/srd-modulo.c	(working copy)
@@ -0,0 +1,42 @@ 
+/* Test that using a character splat to set up a shift-right logical
+   for a doubleword vector works correctly after gimple folding.  */
+
+/* { dg-do run { target { powerpc64*-*-* && vsx_hw } } } */
+/* { dg-options { "-O3" } } */
+
+#include <altivec.h>
+
+typedef __vector unsigned long long vui64_t;
+
+static inline vui64_t
+vec_sradi (vui64_t vra, const unsigned int shb)
+{
+  vui64_t rshift;
+  vui64_t result;
+
+  /* Note legitimate use of wrong-type splat due to expectation that only
+     lower 6-bits are read.  */
+  rshift = (vui64_t) vec_splat_s8((const unsigned char)shb);
+
+  /* Vector Shift Right [Logical] Doublewords based on the lower 6-bits
+     of corresponding element of rshift.  */
+  result = vec_vsrd (vra, rshift);
+
+  return (vui64_t) result;
+}
+
+vui64_t
+test_sradi_4 (vui64_t a)
+{
+  return vec_sradi (a, 4);
+}
+
+int
+main ()
+{
+  vui64_t x = {1992357, 1025};
+  x = test_sradi_4 (x);
+  if (x[0] != 124522 || x[1] != 64)
+    __builtin_abort ();
+  return 0;
+}