rs6000: Vector shift-right should honor modulo semantics

Message ID	f1f5ac18-6698-303d-fa58-95e77453a4c2@linux.ibm.com
State	New
Headers	show Return-Path: <gcc-patches-return-495738-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:to:cc :from:subject:date:mime-version:message-id:content-type :content-transfer-encoding; q=dns; s=default; b=lEhH/iGggIYlp1z3 84ojIGORZem1NyiGxX3kcJkeMA/7+/qODcpoA9Hii9FCblvBKxnNsODYxS6EDelr Cg7Bd/EWVjoqvlJ+J+Fyuew5XGDsSTRRDiEPDR3qESC4aTec/b+Pxg2zt+GfWQzf 9zt3D/y1bCbX4lvAShShZ0axr+4= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org Gateway: Authorized Use Only! Violators will be prosecuted for <gcc-patches@gcc.gnu.org> from <wschmidt@linux.ibm.com>; Sun, 10 Feb 2019 16:10:08 -0000 Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Sun, 10 Feb 2019 16:10:05 -0000 To: GCC Patches <gcc-patches@gcc.gnu.org> Cc: Segher Boessenkool <segher@kernel.crashing.org> From: Bill Schmidt <wschmidt@linux.ibm.com> Subject: [PATCH] rs6000: Vector shift-right should honor modulo semantics Date: Sun, 10 Feb 2019 10:10:02 -0600 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.4.0 MIME-Version: 1.0 Message-Id: <f1f5ac18-6698-303d-fa58-95e77453a4c2@linux.ibm.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit
Series	rs6000: Vector shift-right should honor modulo semantics \| expand rs6000: Vector shift-right should honor modulo semantics

Bill Schmidt Feb. 10, 2019, 4:10 p.m. UTC

Hi!

We had a problem report for code attempting to implement a vector right-shift for a
vector long long (V2DImode) type.  The programmer noted that we don't have a vector
splat-immediate for this mode, but cleverly realized he could use a vector char
splat-immediate since only the lower 6 bits of the shift count are read.  This is a
documented feature of both the vector shift built-ins and the underlying instruction.

Starting with GCC 8, the vector shifts are expanded early in rs6000_gimple_fold_builtin.
However, the GIMPLE folding does not currently perform the required TRUNC_MOD_EXPR to
implement the built-in semantics.  It appears that this was caught earlier for vector
shift-left and fixed, but the same problem was not fixed for vector shift-right.
This patch fixes that.

While fixing that problem, I noted that we get inferior code generation when we try
to fold the vector char splat earlier, due to the type mismatch and some additional
optimizations performed in the middle end.  Because this is a rare circumstance, it
makes sense to avoid the GIMPLE folding in this case and allow the back end to do
the expansion; this produces the clean code we're expecting with the vspltisb intact.

I've added executable tests for both shift-right algebraic and shift-right logical.
Both fail prior to applying the patch, and work correctly afterwards.

Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no regressions.  Is
this okay for trunk, and for GCC 8.3 if there is no fallout by the end of the
week?

Thanks,
Bill


[gcc]

2019-02-08  Bill Schmidt  <wschmidt@linux.ibm.com>

	* config/rs6000/rs6000.c (rs6000_gimple_fold_builtin): Shift-right
	and shift-left vector built-ins need to include a TRUNC_MOD_EXPR
	for correct semantics.  Also, don't expand a vector-splat if there
	is a type mismatch; let the back end handle it.

[gcc/testsuite]

2019-02-08  Bill Schmidt  <wschmidt@linux.ibm.com>

	* gcc.target/powerpc/srad-modulo.c: New.
	* gcc.target/powerpc/srd-modulo.c: New.

Segher Boessenkool Feb. 10, 2019, 10:05 p.m. UTC | #1

Hi Bill,

On Sun, Feb 10, 2019 at 10:10:02AM -0600, Bill Schmidt wrote:
> I've added executable tests for both shift-right algebraic and shift-right logical.
> Both fail prior to applying the patch, and work correctly afterwards.

Please add a test for left shifts, as well?

> 2019-02-08  Bill Schmidt  <wschmidt@linux.ibm.com>
> 
> 	* config/rs6000/rs6000.c (rs6000_gimple_fold_builtin): Shift-right
> 	and shift-left vector built-ins need to include a TRUNC_MOD_EXPR
> 	for correct semantics.  Also, don't expand a vector-splat if there
> 	is a type mismatch; let the back end handle it.

Does it always result in just the shift instruction now?  Does the modulo
ever remain?  (Maybe at -O0?)  Modulo is hugely expensive; it is always
modulo by a power of two, so a simple bitmask, so maybe write that directly
instead?

> 2019-02-08  Bill Schmidt  <wschmidt@linux.ibm.com>
> 
> 	* gcc.target/powerpc/srad-modulo.c: New.
> 	* gcc.target/powerpc/srd-modulo.c: New.

Please put "vec-" in the testcase name.  You may need to rename vec-shift.c
and/or vec-shr.c, which are not as generic as their names suggest.

> @@ -16072,6 +16116,13 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *
>  	arg0 = gimple_call_arg (stmt, 0);
>  	lhs = gimple_call_lhs (stmt);
>  
> +	/* It's sometimes useful to use one of these to build a
> +	   splat for V2DImode, since the upper bits will be ignored.
> +	   Don't fold if we detect that situation, as we end up
> +	   losing the splat instruction in this case.  */
> +	if (size != TREE_INT_CST_LOW (TYPE_SIZE (TREE_TYPE (TREE_TYPE (lhs)))))
> +	  return false;

This isn't really detecting that situation...  Instead, it doesn't fold
whenever the size of the splatted elements isn't the same as the size of
the elts of the target vector.  That's probably perfectly safe, but please
spell it out.  It's fine to mention the motivating case, of course.

> Index: gcc/testsuite/gcc.target/powerpc/srad-modulo.c
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/srad-modulo.c	(nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/srad-modulo.c	(working copy)
> @@ -0,0 +1,43 @@
> +/* Test that using a character splat to set up a shift-right algebraic
> +   for a doubleword vector works correctly after gimple folding.  */
> +
> +/* { dg-do run { target { powerpc64*-*-* && vsx_hw } } } */
> +/* { dg-options { "-O3" } } */

powerpc64*-*-* is almost always wrong.  I don't think you need to limit
to 64-bit at all here, but if you do, test for lp64 instead.

Testing for vsx_hw but not enabling vsx is probably wrong, too.

Does it need -O3, does -O2 not work?

Should this testcase check expected machine code as well?

> --- gcc/testsuite/gcc.target/powerpc/srd-modulo.c	(nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/srd-modulo.c	(working copy)

> +vui64_t
> +test_sradi_4 (vui64_t a)
> +{
> +  return vec_sradi (a, 4);
> +}

Pasto?  (srad vs. srd).

Segher

Bill Schmidt Feb. 11, 2019, 2:42 a.m. UTC | #2

On 2/10/19 4:05 PM, Segher Boessenkool wrote:
> Hi Bill,
>
> On Sun, Feb 10, 2019 at 10:10:02AM -0600, Bill Schmidt wrote:
>> I've added executable tests for both shift-right algebraic and shift-right logical.
>> Both fail prior to applying the patch, and work correctly afterwards.
> Please add a test for left shifts, as well?

Can do.  I verified that left shifts were not broken and figured a test case
had been added then, but have not checked.  Good to test this particular
scenario, though.

>
>> 2019-02-08  Bill Schmidt  <wschmidt@linux.ibm.com>
>>
>> 	* config/rs6000/rs6000.c (rs6000_gimple_fold_builtin): Shift-right
>> 	and shift-left vector built-ins need to include a TRUNC_MOD_EXPR
>> 	for correct semantics.  Also, don't expand a vector-splat if there
>> 	is a type mismatch; let the back end handle it.
> Does it always result in just the shift instruction now?  Does the modulo
> ever remain?  (Maybe at -O0?)  Modulo is hugely expensive; it is always
> modulo by a power of two, so a simple bitmask, so maybe write that directly
> instead?

We always get the shift.  For -mcpu=power8, we always load the mask from
memory rather than generating the vspltisb, which is not ideal code generation,
but is at least correct.

For -mcpu=power9, we get close, but have some bad register allocation and
an unnecessary extend:

        xxspltib 0,4   <- why not just xxspltib 32,4?
        xxlor 32,0,0   <- wasted copy
        vextsb2d 0,0   <- unnecessary due to vsrad semantics
        vsrad 2,2,0

Again, this is at least correct.  We have more work to do to produce the
most efficient code, but we have PR89213 open for that.

>
>> 2019-02-08  Bill Schmidt  <wschmidt@linux.ibm.com>
>>
>> 	* gcc.target/powerpc/srad-modulo.c: New.
>> 	* gcc.target/powerpc/srd-modulo.c: New.
> Please put "vec-" in the testcase name.  You may need to rename vec-shift.c
> and/or vec-shr.c, which are not as generic as their names suggest.

Ok.

>
>> @@ -16072,6 +16116,13 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *
>>  	arg0 = gimple_call_arg (stmt, 0);
>>  	lhs = gimple_call_lhs (stmt);
>>  
>> +	/* It's sometimes useful to use one of these to build a
>> +	   splat for V2DImode, since the upper bits will be ignored.
>> +	   Don't fold if we detect that situation, as we end up
>> +	   losing the splat instruction in this case.  */
>> +	if (size != TREE_INT_CST_LOW (TYPE_SIZE (TREE_TYPE (TREE_TYPE (lhs)))))
>> +	  return false;
> This isn't really detecting that situation...  Instead, it doesn't fold
> whenever the size of the splatted elements isn't the same as the size of
> the elts of the target vector.  That's probably perfectly safe, but please
> spell it out.  It's fine to mention the motivating case, of course.

Yep, will correct.  Actually, as I look back at my notes, I believe that this
change is not necessary after all (same code generated with and without it).
I'll verify.

>
>> Index: gcc/testsuite/gcc.target/powerpc/srad-modulo.c
>> ===================================================================
>> --- gcc/testsuite/gcc.target/powerpc/srad-modulo.c	(nonexistent)
>> +++ gcc/testsuite/gcc.target/powerpc/srad-modulo.c	(working copy)
>> @@ -0,0 +1,43 @@
>> +/* Test that using a character splat to set up a shift-right algebraic
>> +   for a doubleword vector works correctly after gimple folding.  */
>> +
>> +/* { dg-do run { target { powerpc64*-*-* && vsx_hw } } } */
>> +/* { dg-options { "-O3" } } */
> powerpc64*-*-* is almost always wrong.  I don't think you need to limit
> to 64-bit at all here, but if you do, test for lp64 instead.

Ok.

>
> Testing for vsx_hw but not enabling vsx is probably wrong, too.

Weird.  I just tried adding -mvsx and I get this peculiar error we've seen
before about AMD graphics card offloading:

spawn -ignore SIGHUP /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ /home/wschmidt/gcc/gcc-mainline-t\
est2/gcc/testsuite/gcc.target/powerpc/srad-modulo.c -fno-diagnostics-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -O2 -mvsx -lm -o \
./srad-modulo.exe^M
^[[01m^[[Kcc1:^[[m^[[K ^[[01;31m^[[Kerror: ^[[m^[[Kargument to '^[[01m^[[K-O^[[m^[[K' should be a non-negative integer, '^[[01m^[[Kg^[[m^[[K', '^[[01m^[[Ks^[[\
m^[[K' or '^[[01m^[[Kfast^[[m^[[K'^M
compiler exited with status 1
Executing on host: /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ offload_gcn7262.c    -fno-diagnosti\
cs-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never  -foffload=amdgcn-unknown-amdhsa -S -o offload_gcn7262.s    (timeout = 300)
spawn -ignore SIGHUP /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ offload_gcn7262.c -fno-diagnostic\
s-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -foffload=amdgcn-unknown-amdhsa -S -o offload_gcn7262.s^M
xgcc: fatal error: GCC is not configured to support amdgcn-unknown-amdhsa as offload target^M
compilation terminated.^M
compiler exited with status 1
FAIL: gcc.target/powerpc/srad-modulo.c (test for excess errors)
Excess errors:
^[[01m^[[Kcc1:^[[m^[[K ^[[01;31m^[[Kerror: ^[[m^[[Kargument to '^[[01m^[[K-O^[[m^[[K' should be a non-negative integer, '^[[01m^[[Kg^[[m^[[K', '^[[01m^[[Ks^[[\
m^[[K' or '^[[01m^[[Kfast^[[m^[[K'

Any ideas what's causing this?  I can't test with -mvsx until it's fixed.
Do we still have a bug open on this?

>
> Does it need -O3, does -O2 not work?

-O2 is fine, can change to that.

>
> Should this testcase check expected machine code as well?

I think we should defer that to the fix for PR89213.  We aren't at
ideal code quite yet.

>
>> --- gcc/testsuite/gcc.target/powerpc/srd-modulo.c	(nonexistent)
>> +++ gcc/testsuite/gcc.target/powerpc/srd-modulo.c	(working copy)
>> +vui64_t
>> +test_sradi_4 (vui64_t a)
>> +{
>> +  return vec_sradi (a, 4);
>> +}
> Pasto?  (srad vs. srd).

Ecch.  Will fix.  

>
>
> Segher
>

Bill Schmidt Feb. 11, 2019, 2:56 a.m. UTC | #3

On 2/10/19 8:42 PM, Bill Schmidt wrote:
> On 2/10/19 4:05 PM, Segher Boessenkool wrote:
>> Hi Bill,
>>
>> On Sun, Feb 10, 2019 at 10:10:02AM -0600, Bill Schmidt wrote:
>>> I've added executable tests for both shift-right algebraic and shift-right logical.
>>> Both fail prior to applying the patch, and work correctly afterwards.
>> Please add a test for left shifts, as well?
> Can do.  I verified that left shifts were not broken and figured a test case
> had been added then, but have not checked.  Good to test this particular
> scenario, though.
>
>>> 2019-02-08  Bill Schmidt  <wschmidt@linux.ibm.com>
>>>
>>> 	* config/rs6000/rs6000.c (rs6000_gimple_fold_builtin): Shift-right
>>> 	and shift-left vector built-ins need to include a TRUNC_MOD_EXPR
>>> 	for correct semantics.  Also, don't expand a vector-splat if there
>>> 	is a type mismatch; let the back end handle it.
>> Does it always result in just the shift instruction now?  Does the modulo
>> ever remain?  (Maybe at -O0?)  Modulo is hugely expensive; it is always
>> modulo by a power of two, so a simple bitmask, so maybe write that directly
>> instead?
> We always get the shift.  For -mcpu=power8, we always load the mask from
> memory rather than generating the vspltisb, which is not ideal code generation,
> but is at least correct.
>
> For -mcpu=power9, we get close, but have some bad register allocation and
> an unnecessary extend:
>
>         xxspltib 0,4   <- why not just xxspltib 32,4?
>         xxlor 32,0,0   <- wasted copy
>         vextsb2d 0,0   <- unnecessary due to vsrad semantics
>         vsrad 2,2,0
>
> Again, this is at least correct.  We have more work to do to produce the
> most efficient code, but we have PR89213 open for that.
>
>>> 2019-02-08  Bill Schmidt  <wschmidt@linux.ibm.com>
>>>
>>> 	* gcc.target/powerpc/srad-modulo.c: New.
>>> 	* gcc.target/powerpc/srd-modulo.c: New.
>> Please put "vec-" in the testcase name.  You may need to rename vec-shift.c
>> and/or vec-shr.c, which are not as generic as their names suggest.
> Ok.
>
>>> @@ -16072,6 +16116,13 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *
>>>  	arg0 = gimple_call_arg (stmt, 0);
>>>  	lhs = gimple_call_lhs (stmt);
>>>  
>>> +	/* It's sometimes useful to use one of these to build a
>>> +	   splat for V2DImode, since the upper bits will be ignored.
>>> +	   Don't fold if we detect that situation, as we end up
>>> +	   losing the splat instruction in this case.  */
>>> +	if (size != TREE_INT_CST_LOW (TYPE_SIZE (TREE_TYPE (TREE_TYPE (lhs)))))
>>> +	  return false;
>> This isn't really detecting that situation...  Instead, it doesn't fold
>> whenever the size of the splatted elements isn't the same as the size of
>> the elts of the target vector.  That's probably perfectly safe, but please
>> spell it out.  It's fine to mention the motivating case, of course.
> Yep, will correct.  Actually, as I look back at my notes, I believe that this
> change is not necessary after all (same code generated with and without it).
> I'll verify.
>
>>> Index: gcc/testsuite/gcc.target/powerpc/srad-modulo.c
>>> ===================================================================
>>> --- gcc/testsuite/gcc.target/powerpc/srad-modulo.c	(nonexistent)
>>> +++ gcc/testsuite/gcc.target/powerpc/srad-modulo.c	(working copy)
>>> @@ -0,0 +1,43 @@
>>> +/* Test that using a character splat to set up a shift-right algebraic
>>> +   for a doubleword vector works correctly after gimple folding.  */
>>> +
>>> +/* { dg-do run { target { powerpc64*-*-* && vsx_hw } } } */
>>> +/* { dg-options { "-O3" } } */
>> powerpc64*-*-* is almost always wrong.  I don't think you need to limit
>> to 64-bit at all here, but if you do, test for lp64 instead.
> Ok.
>
>> Testing for vsx_hw but not enabling vsx is probably wrong, too.
> Weird.  I just tried adding -mvsx and I get this peculiar error we've seen
> before about AMD graphics card offloading:
>
> spawn -ignore SIGHUP /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ /home/wschmidt/gcc/gcc-mainline-t\
> est2/gcc/testsuite/gcc.target/powerpc/srad-modulo.c -fno-diagnostics-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -O2 -mvsx -lm -o \
> ./srad-modulo.exe^M
> ^[[01m^[[Kcc1:^[[m^[[K ^[[01;31m^[[Kerror: ^[[m^[[Kargument to '^[[01m^[[K-O^[[m^[[K' should be a non-negative integer, '^[[01m^[[Kg^[[m^[[K', '^[[01m^[[Ks^[[\
> m^[[K' or '^[[01m^[[Kfast^[[m^[[K'^M
> compiler exited with status 1
> Executing on host: /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ offload_gcn7262.c    -fno-diagnosti\
> cs-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never  -foffload=amdgcn-unknown-amdhsa -S -o offload_gcn7262.s    (timeout = 300)
> spawn -ignore SIGHUP /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ offload_gcn7262.c -fno-diagnostic\
> s-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -foffload=amdgcn-unknown-amdhsa -S -o offload_gcn7262.s^M
> xgcc: fatal error: GCC is not configured to support amdgcn-unknown-amdhsa as offload target^M
> compilation terminated.^M
> compiler exited with status 1
> FAIL: gcc.target/powerpc/srad-modulo.c (test for excess errors)
> Excess errors:
> ^[[01m^[[Kcc1:^[[m^[[K ^[[01;31m^[[Kerror: ^[[m^[[Kargument to '^[[01m^[[K-O^[[m^[[K' should be a non-negative integer, '^[[01m^[[Kg^[[m^[[K', '^[[01m^[[Ks^[[\
> m^[[K' or '^[[01m^[[Kfast^[[m^[[K'
>
> Any ideas what's causing this?  I can't test with -mvsx until it's fixed.
> Do we still have a bug open on this?

Looks like this was PR88920, and apparently this silly offload thing pops up 
once per testsuite run if there's a failure of any kind.  I don't consider just
caching it so it only happens once to be an adequate fix... this shouldn't be
happening at all and is highly misleading.

The actual problem is that I've got an extra set of braces in the dg-options
directive... oops...

Thanks,
Bill


>
>> Does it need -O3, does -O2 not work?
> -O2 is fine, can change to that.
>
>> Should this testcase check expected machine code as well?
> I think we should defer that to the fix for PR89213.  We aren't at
> ideal code quite yet.
>
>>> --- gcc/testsuite/gcc.target/powerpc/srd-modulo.c	(nonexistent)
>>> +++ gcc/testsuite/gcc.target/powerpc/srd-modulo.c	(working copy)
>>> +vui64_t
>>> +test_sradi_4 (vui64_t a)
>>> +{
>>> +  return vec_sradi (a, 4);
>>> +}
>> Pasto?  (srad vs. srd).
> Ecch.  Will fix.  
>
>>
>> Segher
>>

Segher Boessenkool Feb. 11, 2019, 12:43 p.m. UTC | #4

On Sun, Feb 10, 2019 at 08:42:42PM -0600, Bill Schmidt wrote:
> On 2/10/19 4:05 PM, Segher Boessenkool wrote:
> > On Sun, Feb 10, 2019 at 10:10:02AM -0600, Bill Schmidt wrote:
> >> I've added executable tests for both shift-right algebraic and shift-right logical.
> >> Both fail prior to applying the patch, and work correctly afterwards.
> > Please add a test for left shifts, as well?
> 
> Can do.  I verified that left shifts were not broken and figured a test case
> had been added then, but have not checked.  Good to test this particular
> scenario, though.

Well you actually added code for left shifts, too ;-)

> >> 2019-02-08  Bill Schmidt  <wschmidt@linux.ibm.com>
> >>
> >> 	* config/rs6000/rs6000.c (rs6000_gimple_fold_builtin): Shift-right
> >> 	and shift-left vector built-ins need to include a TRUNC_MOD_EXPR
> >> 	for correct semantics.  Also, don't expand a vector-splat if there
> >> 	is a type mismatch; let the back end handle it.
> > Does it always result in just the shift instruction now?  Does the modulo
> > ever remain?  (Maybe at -O0?)  Modulo is hugely expensive; it is always
> > modulo by a power of two, so a simple bitmask, so maybe write that directly
> > instead?
> 
> We always get the shift.  For -mcpu=power8, we always load the mask from
> memory rather than generating the vspltisb, which is not ideal code generation,
> but is at least correct.

_Just_ the shift, I meant.  I know we load a constant from memory, for
constants; I am worried about non-constants, do those get a modulo, or
just a mask.  Also at -O0; if we get an actual division at -O0, there
probably are other cases where it isn't optimised away as well.

> For -mcpu=power9, we get close, but have some bad register allocation and
> an unnecessary extend:
> 
>         xxspltib 0,4   <- why not just xxspltib 32,4?
>         xxlor 32,0,0   <- wasted copy

Yeah, huh.  Where does that come from...  I blame splitters after reload.

>         vextsb2d 0,0   <- unnecessary due to vsrad semantics
>         vsrad 2,2,0
> 
> Again, this is at least correct.  We have more work to do to produce the
> most efficient code, but we have PR89213 open for that.

Yes.  But I'd like to see at least a mask instead of a modulo.

> >> +	/* It's sometimes useful to use one of these to build a
> >> +	   splat for V2DImode, since the upper bits will be ignored.
> >> +	   Don't fold if we detect that situation, as we end up
> >> +	   losing the splat instruction in this case.  */
> >> +	if (size != TREE_INT_CST_LOW (TYPE_SIZE (TREE_TYPE (TREE_TYPE (lhs)))))
> >> +	  return false;
> > This isn't really detecting that situation...  Instead, it doesn't fold
> > whenever the size of the splatted elements isn't the same as the size of
> > the elts of the target vector.  That's probably perfectly safe, but please
> > spell it out.  It's fine to mention the motivating case, of course.
> 
> Yep, will correct.  Actually, as I look back at my notes, I believe that this
> change is not necessary after all (same code generated with and without it).
> I'll verify.

Ha, that would be nice :-)

> > Testing for vsx_hw but not enabling vsx is probably wrong, too.
> 
> Weird.  I just tried adding -mvsx

Does it _need_ VSX anyway?  Are these builtins defined without it, too?

> and I get this peculiar error we've seen
> before about AMD graphics card offloading:
> 
> spawn -ignore SIGHUP /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ /home/wschmidt/gcc/gcc-mainline-t\
> est2/gcc/testsuite/gcc.target/powerpc/srad-modulo.c -fno-diagnostics-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -O2 -mvsx -lm -o \
> ./srad-modulo.exe^M
> ^[[01m^[[Kcc1:^[[m^[[K ^[[01;31m^[[Kerror: ^[[m^[[Kargument to '^[[01m^[[K-O^[[m^[[K' should be a non-negative integer, '^[[01m^[[Kg^[[m^[[K', '^[[01m^[[Ks^[[\
> m^[[K' or '^[[01m^[[Kfast^[[m^[[K'^M
> compiler exited with status 1

Huh, why do you get colour escapes at all?  I thought the testsuite used
GCC_COLORS= to get rid of that nonsense.  It's quite unreadable like this.

> Executing on host: /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ offload_gcn7262.c    -fno-diagnosti\
> cs-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never  -foffload=amdgcn-unknown-amdhsa -S -o offload_gcn7262.s    (timeout = 300)
> spawn -ignore SIGHUP /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ offload_gcn7262.c -fno-diagnostic\
> s-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -foffload=amdgcn-unknown-amdhsa -S -o offload_gcn7262.s^M
> xgcc: fatal error: GCC is not configured to support amdgcn-unknown-amdhsa as offload target^M
> compilation terminated.^M
> compiler exited with status 1

This is a test to see if there is offload stuff.  There isn't.

> FAIL: gcc.target/powerpc/srad-modulo.c (test for excess errors)
> Excess errors:
> ^[[01m^[[Kcc1:^[[m^[[K ^[[01;31m^[[Kerror: ^[[m^[[Kargument to '^[[01m^[[K-O^[[m^[[K' should be a non-negative integer, '^[[01m^[[Kg^[[m^[[K', '^[[01m^[[Ks^[[\
> m^[[K' or '^[[01m^[[Kfast^[[m^[[K'
> 
> Any ideas what's causing this?  I can't test with -mvsx until it's fixed.
> Do we still have a bug open on this?

I have never seen this before, not in bugzilla, not in my own testing (I do
however almost always use --disable-libgomp, it breaks way too often,
impeding all other progress).

> > Does it need -O3, does -O2 not work?
> 
> -O2 is fine, can change to that.

Please do.

> > Should this testcase check expected machine code as well?
> 
> I think we should defer that to the fix for PR89213.  We aren't at
> ideal code quite yet.

Gotcha.


Segher

Bill Schmidt Feb. 11, 2019, 1:17 p.m. UTC | #5

On 2/11/19 6:43 AM, Segher Boessenkool wrote:
> On Sun, Feb 10, 2019 at 08:42:42PM -0600, Bill Schmidt wrote:
>> On 2/10/19 4:05 PM, Segher Boessenkool wrote:
>>> On Sun, Feb 10, 2019 at 10:10:02AM -0600, Bill Schmidt wrote:
>>>> I've added executable tests for both shift-right algebraic and shift-right logical.
>>>> Both fail prior to applying the patch, and work correctly afterwards.
>>> Please add a test for left shifts, as well?
>> Can do.  I verified that left shifts were not broken and figured a test case
>> had been added then, but have not checked.  Good to test this particular
>> scenario, though.
> Well you actually added code for left shifts, too ;-)

Nope. :-)  It just looks like it because of the way the diff hunks come out.
The changes are to shift-right-logical and shift-right-algebraic.  But it
still doesn't hurt to add the test.

>
>>>> 2019-02-08  Bill Schmidt  <wschmidt@linux.ibm.com>
>>>>
>>>> 	* config/rs6000/rs6000.c (rs6000_gimple_fold_builtin): Shift-right
>>>> 	and shift-left vector built-ins need to include a TRUNC_MOD_EXPR
>>>> 	for correct semantics.  Also, don't expand a vector-splat if there
>>>> 	is a type mismatch; let the back end handle it.
>>> Does it always result in just the shift instruction now?  Does the modulo
>>> ever remain?  (Maybe at -O0?)  Modulo is hugely expensive; it is always
>>> modulo by a power of two, so a simple bitmask, so maybe write that directly
>>> instead?
>> We always get the shift.  For -mcpu=power8, we always load the mask from
>> memory rather than generating the vspltisb, which is not ideal code generation,
>> but is at least correct.
> _Just_ the shift, I meant.  I know we load a constant from memory, for
> constants; I am worried about non-constants, do those get a modulo, or
> just a mask.  Also at -O0; if we get an actual division at -O0, there
> probably are other cases where it isn't optimised away as well.

Even at -O1 this is all folded away in GIMPLE:

test_sradi_4 (vi64_t a)
{
  vi64_t result;

  <bb 2> [local count: 1073741824]:
  result_3 = a_2(D) >> 4;
  return result_3;

}

At -O0 (if I hand-inline everything myself to avoid errors), we scalarize
the modulo/masking operation into a rldicl for each doubleword.  I really
don't see any reason to change the code.

>
>> For -mcpu=power9, we get close, but have some bad register allocation and
>> an unnecessary extend:
>>
>>         xxspltib 0,4   <- why not just xxspltib 32,4?
>>         xxlor 32,0,0   <- wasted copy
> Yeah, huh.  Where does that come from...  I blame splitters after reload.

This only happens at -O2 and up, FWIW.  At -O1 we allocate the registers
reasonably.

>
>>         vextsb2d 0,0   <- unnecessary due to vsrad semantics
>>         vsrad 2,2,0
>>
>> Again, this is at least correct.  We have more work to do to produce the
>> most efficient code, but we have PR89213 open for that.
> Yes.  But I'd like to see at least a mask instead of a modulo.

Again, there is no modulo by the time we exit GIMPLE.  Our efficiency
failings are all in the back end.

>
>>>> +	/* It's sometimes useful to use one of these to build a
>>>> +	   splat for V2DImode, since the upper bits will be ignored.
>>>> +	   Don't fold if we detect that situation, as we end up
>>>> +	   losing the splat instruction in this case.  */
>>>> +	if (size != TREE_INT_CST_LOW (TYPE_SIZE (TREE_TYPE (TREE_TYPE (lhs)))))
>>>> +	  return false;
>>> This isn't really detecting that situation...  Instead, it doesn't fold
>>> whenever the size of the splatted elements isn't the same as the size of
>>> the elts of the target vector.  That's probably perfectly safe, but please
>>> spell it out.  It's fine to mention the motivating case, of course.
>> Yep, will correct.  Actually, as I look back at my notes, I believe that this
>> change is not necessary after all (same code generated with and without it).
>> I'll verify.
> Ha, that would be nice :-)

Confirmed, BTW.

>
>>> Testing for vsx_hw but not enabling vsx is probably wrong, too.
>> Weird.  I just tried adding -mvsx
> Does it _need_ VSX anyway?  Are these builtins defined without it, too?

Yes (vector long long / V2DImode requires VSX).

>
>> and I get this peculiar error we've seen
>> before about AMD graphics card offloading:
>>
>> spawn -ignore SIGHUP /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ /home/wschmidt/gcc/gcc-mainline-t\
>> est2/gcc/testsuite/gcc.target/powerpc/srad-modulo.c -fno-diagnostics-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -O2 -mvsx -lm -o \
>> ./srad-modulo.exe^M
>> ^[[01m^[[Kcc1:^[[m^[[K ^[[01;31m^[[Kerror: ^[[m^[[Kargument to '^[[01m^[[K-O^[[m^[[K' should be a non-negative integer, '^[[01m^[[Kg^[[m^[[K', '^[[01m^[[Ks^[[\
>> m^[[K' or '^[[01m^[[Kfast^[[m^[[K'^M
>> compiler exited with status 1
> Huh, why do you get colour escapes at all?  I thought the testsuite used
> GCC_COLORS= to get rid of that nonsense.  It's quite unreadable like this.

Yep, that's why it took me a while to figure out what was going on. ;-)  It was
objecting to "-O3 -mvsx" looking like one argument because of an extra set of
parens in my dg-options directive.

>
>> Executing on host: /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ offload_gcn7262.c    -fno-diagnosti\
>> cs-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never  -foffload=amdgcn-unknown-amdhsa -S -o offload_gcn7262.s    (timeout = 300)
>> spawn -ignore SIGHUP /home/wschmidt/gcc/build/gcc-mainline-test2/gcc/xgcc -B/home/wschmidt/gcc/build/gcc-mainline-test2/gcc/ offload_gcn7262.c -fno-diagnostic\
>> s-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -foffload=amdgcn-unknown-amdhsa -S -o offload_gcn7262.s^M
>> xgcc: fatal error: GCC is not configured to support amdgcn-unknown-amdhsa as offload target^M
>> compilation terminated.^M
>> compiler exited with status 1
> This is a test to see if there is offload stuff.  There isn't.

Which should never be run for RUNTESTFLAGS="powerpc.exp=vec-srad.c" ... something
is really hosed up.

>
>> FAIL: gcc.target/powerpc/srad-modulo.c (test for excess errors)
>> Excess errors:
>> ^[[01m^[[Kcc1:^[[m^[[K ^[[01;31m^[[Kerror: ^[[m^[[Kargument to '^[[01m^[[K-O^[[m^[[K' should be a non-negative integer, '^[[01m^[[Kg^[[m^[[K', '^[[01m^[[Ks^[[\
>> m^[[K' or '^[[01m^[[Kfast^[[m^[[K'
>>
>> Any ideas what's causing this?  I can't test with -mvsx until it's fixed.
>> Do we still have a bug open on this?
> I have never seen this before, not in bugzilla, not in my own testing (I do
> however almost always use --disable-libgomp, it breaks way too often,
> impeding all other progress).

I pointed to the bugzilla in another reply -- which was "resolved" with a hack.
I consider it still broken this way...

I tested a revised version of the patch overnight and will submit shortly.

Bill

>
>>> Does it need -O3, does -O2 not work?
>> -O2 is fine, can change to that.
> Please do.
>
>>> Should this testcase check expected machine code as well?
>> I think we should defer that to the fix for PR89213.  We aren't at
>> ideal code quite yet.
> Gotcha.
>
>
> Segher
>

Segher Boessenkool Feb. 11, 2019, 2:11 p.m. UTC | #6

On Mon, Feb 11, 2019 at 07:17:16AM -0600, Bill Schmidt wrote:
> At -O0 (if I hand-inline everything myself to avoid errors), we scalarize
> the modulo/masking operation into a rldicl for each doubleword.  I really
> don't see any reason to change the code.

So what does this look like at expand (at -O0)?  Is it something that
is done at gimple level, is it expand itself, is it some target thing?

> >> For -mcpu=power9, we get close, but have some bad register allocation and
> >> an unnecessary extend:
> >>
> >>         xxspltib 0,4   <- why not just xxspltib 32,4?
> >>         xxlor 32,0,0   <- wasted copy
> > Yeah, huh.  Where does that come from...  I blame splitters after reload.
> 
> This only happens at -O2 and up, FWIW.  At -O1 we allocate the registers
> reasonably.

Heh.

> >> Weird.  I just tried adding -mvsx
> > Does it _need_ VSX anyway?  Are these builtins defined without it, too?
> 
> Yes (vector long long / V2DImode requires VSX).

So something like

/* { dg-do run } */
/* { dg-require-effective-target vsx_hw } */
/* { dg-options "-mvsx -O2" } */

then?

> I pointed to the bugzilla in another reply -- which was "resolved" with a hack.
> I consider it still broken this way...

Reopen that PR?

> I tested a revised version of the patch overnight and will submit shortly.

Thanks.


Segher

Bill Schmidt Feb. 11, 2019, 2:35 p.m. UTC | #7

On 2/11/19 8:11 AM, Segher Boessenkool wrote:
> On Mon, Feb 11, 2019 at 07:17:16AM -0600, Bill Schmidt wrote:
>> At -O0 (if I hand-inline everything myself to avoid errors), we scalarize
>> the modulo/masking operation into a rldicl for each doubleword.  I really
>> don't see any reason to change the code.
> So what does this look like at expand (at -O0)?  Is it something that
> is done at gimple level, is it expand itself, is it some target thing?

It's already a mask at expand, even at -O0.  vregs dump excerpt:

(insn 13 12 14 2 (set (reg:DI 129 [ _16 ])
        (vec_select:DI (reg:V2DI 126 [ _9 ])
            (parallel [
                    (const_int 0 [0])
                ]))) "vec-srad-modulo.c":40:10 1223 {vsx_extract_v2di}
     (nil))
(insn 14 13 15 2 (set (reg:DI 130 [ _17 ])
        (and:DI (reg:DI 129 [ _16 ])
            (const_int 63 [0x3f]))) "vec-srad-modulo.c":40:10 195 {anddi3_mask}
     (nil))
(insn 15 14 16 2 (set (reg:DI 131 [ _18 ])
        (vec_select:DI (reg:V2DI 126 [ _9 ])
            (parallel [
                    (const_int 1 [0x1])
                ]))) "vec-srad-modulo.c":40:10 1223 {vsx_extract_v2di}
     (nil))
(insn 16 15 17 2 (set (reg:DI 132 [ _19 ])
        (and:DI (reg:DI 131 [ _18 ])
            (const_int 63 [0x3f]))) "vec-srad-modulo.c":40:10 195 {anddi3_mask}
     (nil))

>
>>>> For -mcpu=power9, we get close, but have some bad register allocation and
>>>> an unnecessary extend:
>>>>
>>>>         xxspltib 0,4   <- why not just xxspltib 32,4?
>>>>         xxlor 32,0,0   <- wasted copy
>>> Yeah, huh.  Where does that come from...  I blame splitters after reload.
>> This only happens at -O2 and up, FWIW.  At -O1 we allocate the registers
>> reasonably.
> Heh.
>
>>>> Weird.  I just tried adding -mvsx
>>> Does it _need_ VSX anyway?  Are these builtins defined without it, too?
>> Yes (vector long long / V2DImode requires VSX).
> So something like
>
> /* { dg-do run } */
> /* { dg-require-effective-target vsx_hw } */
> /* { dg-options "-mvsx -O2" } */
>
> then?

What I have now:

/* { dg-do run { target { vsx_hw } } } */                                      
/* { dg-options "-O2 -mvsx" } */                                               

>
>> I pointed to the bugzilla in another reply -- which was "resolved" with a hack.
>> I consider it still broken this way...
> Reopen that PR?

Already done. ;-)

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88920

Bill

>
>> I tested a revised version of the patch overnight and will submit shortly.
> Thanks.
>
>
> Segher
>

rs6000: Vector shift-right should honor modulo semantics

Commit Message

Comments

Patch