diff mbox series

[6/8] gcn: Add else operand to masked loads.

Message ID D3DDVMZWT931.3KMYVL9G36NRD@gmail.com
State New
Headers show
Series [1/8] docs: Document maskload else operand and behavior. | expand

Commit Message

Robin Dapp Aug. 11, 2024, 9 p.m. UTC
This patch adds a zero else operand to the masked loads.

gcc/ChangeLog:

	* config/gcn/predicates.md (maskload_else_operand): New
	predicate.
	* config/gcn/gcn-valu.md: Use new predicate.
---
 gcc/config/gcn/gcn-valu.md   | 6 ++++--
 gcc/config/gcn/predicates.md | 3 +++
 2 files changed, 7 insertions(+), 2 deletions(-)

Comments

Jeff Law Aug. 12, 2024, 6 p.m. UTC | #1
On 8/11/24 3:00 PM, Robin Dapp wrote:
> This patch adds a zero else operand to the masked loads.
> 
> gcc/ChangeLog:
> 
> 	* config/gcn/predicates.md (maskload_else_operand): New
> 	predicate.
> 	* config/gcn/gcn-valu.md: Use new predicate.
OK if the GCN maintainers don't chime in by the time the generic bits 
are ACK'd.

jeff
Andrew Stubbs Sept. 5, 2024, 5:01 p.m. UTC | #2
(Sorry, I missed this because I was on vacation.)

On 11/08/2024 22:00, Robin Dapp wrote:
> This patch adds a zero else operand to the masked loads.

The patch is OK, but I have a question below.

> gcc/ChangeLog:
> 
> 	* config/gcn/predicates.md (maskload_else_operand): New
> 	predicate.
> 	* config/gcn/gcn-valu.md: Use new predicate.
> ---
>   gcc/config/gcn/gcn-valu.md   | 6 ++++--
>   gcc/config/gcn/predicates.md | 3 +++
>   2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
> index b24cf9be32e..2344bc00ffc 100644
> --- a/gcc/config/gcn/gcn-valu.md
> +++ b/gcc/config/gcn/gcn-valu.md
> @@ -4002,7 +4002,8 @@ (define_expand "while_ultsidi"
>   (define_expand "maskload<mode>di"
>     [(match_operand:V_MOV 0 "register_operand")
>      (match_operand:V_MOV 1 "memory_operand")
> -   (match_operand 2 "")]
> +   (match_operand 2 "")
> +   (match_operand:V_MOV 3 "maskload_else_operand")]
>     ""
>     {
>       rtx exec = force_reg (DImode, operands[2]);
> @@ -4040,7 +4041,8 @@ (define_expand "mask_gather_load<mode><vnsi>"
>      (match_operand:<VnSI> 2 "register_operand")
>      (match_operand 3 "immediate_operand")
>      (match_operand:SI 4 "gcn_alu_operand")
> -   (match_operand:DI 5 "")]
> +   (match_operand:DI 5 "")
> +   (match_operand:V_MOV 6 "maskload_else_operand")]
>     ""
>     {
>       rtx exec = force_reg (DImode, operands[5]);
> diff --git a/gcc/config/gcn/predicates.md b/gcc/config/gcn/predicates.md
> index 3f59396a649..9bc806cf990 100644
> --- a/gcc/config/gcn/predicates.md
> +++ b/gcc/config/gcn/predicates.md
> @@ -228,3 +228,6 @@ (define_predicate "ascending_zero_int_parallel"
>     return gcn_stepped_zero_int_parallel_p (op, 1);
>   })
>   
> +(define_predicate "maskload_else_operand"
> +  (and (match_code "const_int,const_vector")
> +       (match_test "op == CONST0_RTX (GET_MODE (op))")))

This forces maskload and mask_gather_load to only accept zero here, but 
in fact the hardware would allow us to accept any value (including 
undefined).

Here's the expand code:

   /* Masked lanes are required to hold zero.  */
   emit_move_insn (operands[0], gcn_vec_constant (<MODE>mode, 0));

   emit_insn (gen_gather<mode>_expr_exec (operands[0], addr, as, v,
                                          operands[0], exec));

In other words, initialize the whole vector to zero, and then use the 
gather_load instruction to implement the masked load (GCN does not have 
a contiguous-memory vector load instruction).

We could easily omit the initialization instruction, or pass through the 
new value.

Would there be any advantage to accepting other values, or is forcing 
zero actually the right choice?

Thanks for the patch

Andrew
Robin Dapp Sept. 5, 2024, 8:10 p.m. UTC | #3
> > +(define_predicate "maskload_else_operand"
> > +  (and (match_code "const_int,const_vector")
> > +       (match_test "op == CONST0_RTX (GET_MODE (op))")))
>
> This forces maskload and mask_gather_load to only accept zero here, but 
> in fact the hardware would allow us to accept any value (including 
> undefined).
>
> Here's the expand code:
>
>    /* Masked lanes are required to hold zero.  */
>    emit_move_insn (operands[0], gcn_vec_constant (<MODE>mode, 0));
>
>    emit_insn (gen_gather<mode>_expr_exec (operands[0], addr, as, v,
>                                           operands[0], exec));

Where does this requirement come from?  Did you face problems when not
setting the masked lanes to zero?  I also realized the original problem
with a gather load, but, as Richi noticed in the meanwhile, only happens
with padding types.

Generally the idea for the patch is to not require unconditional zeroing
everything but also enable undefined (like riscv) and leave the decision
to the vectorizer.  So it would only zero if needed.
Andrew Stubbs Sept. 5, 2024, 9:57 p.m. UTC | #4
On Thu, 5 Sept 2024, 21:10 Robin Dapp, <rdapp.gcc@gmail.com> wrote:

> > > +(define_predicate "maskload_else_operand"
> > > +  (and (match_code "const_int,const_vector")
> > > +       (match_test "op == CONST0_RTX (GET_MODE (op))")))
> >
> > This forces maskload and mask_gather_load to only accept zero here, but
> > in fact the hardware would allow us to accept any value (including
> > undefined).
> >
> > Here's the expand code:
> >
> >    /* Masked lanes are required to hold zero.  */
> >    emit_move_insn (operands[0], gcn_vec_constant (<MODE>mode, 0));
> >
> >    emit_insn (gen_gather<mode>_expr_exec (operands[0], addr, as, v,
> >                                           operands[0], exec));
>
> Where does this requirement come from?  Did you face problems when not
> setting the masked lanes to zero?  I also realized the original problem
> with a gather load, but, as Richi noticed in the meanwhile, only happens
> with padding types.
>

There were absolutely problems without this. It's a while ago now, so I'm
struggling with the details, but as GCC only applies the mask to selected
operations there were all sorts of issues that crept in. Zeroing the
undefined lanes seemed to match the middle end assumptions (or, at least it
made the UB consistent?) or maybe I read it in the code somewhere. Sorry,
it's years since I wrote that.

Generally the idea for the patch is to not require unconditional zeroing
> everything but also enable undefined (like riscv) and leave the decision
> to the vectorizer.  So it would only zero if needed.


This sounds like a generally good plan. Better than just zero it and hope
that's right anyway. ;)

So, in theory, is it better if amdgcn allows both? Or is that one little
move immediate instruction in the backend going to produce better/cleaner
middle end code?

Andrew
Palmer Dabbelt Sept. 5, 2024, 11:59 p.m. UTC | #5
On Thu, 05 Sep 2024 14:57:06 PDT (-0700), ams@baylibre.com wrote:
> On Thu, 5 Sept 2024, 21:10 Robin Dapp, <rdapp.gcc@gmail.com> wrote:
>
>> > > +(define_predicate "maskload_else_operand"
>> > > +  (and (match_code "const_int,const_vector")
>> > > +       (match_test "op == CONST0_RTX (GET_MODE (op))")))
>> >
>> > This forces maskload and mask_gather_load to only accept zero here, but
>> > in fact the hardware would allow us to accept any value (including
>> > undefined).
>> >
>> > Here's the expand code:
>> >
>> >    /* Masked lanes are required to hold zero.  */
>> >    emit_move_insn (operands[0], gcn_vec_constant (<MODE>mode, 0));
>> >
>> >    emit_insn (gen_gather<mode>_expr_exec (operands[0], addr, as, v,
>> >                                           operands[0], exec));
>>
>> Where does this requirement come from?  Did you face problems when not
>> setting the masked lanes to zero?  I also realized the original problem
>> with a gather load, but, as Richi noticed in the meanwhile, only happens
>> with padding types.
>>
>
> There were absolutely problems without this. It's a while ago now, so I'm
> struggling with the details, but as GCC only applies the mask to selected
> operations there were all sorts of issues that crept in. Zeroing the
> undefined lanes seemed to match the middle end assumptions (or, at least it
> made the UB consistent?) or maybe I read it in the code somewhere. Sorry,
> it's years since I wrote that.

Sorry to just randomly chime in here, but I bet Robin's asleep and I 
think you guys might just be talking past each other (and for some 
reason this one's poking my phone, so I kept seeing the messages).

[Also something's a bit broken with the thread in general, it looks like 
some bits didn't make inbox.sourceware.org.  They're all in my gmail, so 
hopefully they're possible to dig up on your end.  Here's the best 
top-level I could find: 
https://inbox.sourceware.org/gcc-patches/D3DDV8VM4AR5.USIBT8RV5G5Z@gmail.com/]

We ran into this unwritten "masked elements are zeroed" requirement in 
RISC-V as well, which resulted in a handful of bugs.  The initial 
proposed workaround was to jam a bunch of zeroing into the backend, but 
that's both a bit inefficient and appears to have not fully worked due 
to some quirks of how the RISC-V vector extension configuration works 
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115336>.

So Robin now has this patch set to adujst the middle end to be explicit 
about what values it assumes for the unmasked elements, which we're then 
taking advantage of on RISC-V by not inserting the pre-zeroing.

> Generally the idea for the patch is to not require unconditional zeroing
>> everything but also enable undefined (like riscv) and leave the decision
>> to the vectorizer.  So it would only zero if needed.
>
>
> This sounds like a generally good plan. Better than just zero it and hope
> that's right anyway. ;)
>
> So, in theory, is it better if amdgcn allows both? Or is that one little
> move immediate instruction in the backend going to produce better/cleaner
> middle end code?

IIUC the best code would come frome backends matching what the HW does 
and relying on the middle end to insert explicit ops to set the unmasked 
values where necessary.  Unless the hardware has some faster way to 
insert the zeros on masked elements (which IIUC is how x86 and aarch64 
work).

I think Robin was just trying to adjust the backends to have no 
functional changes while still respecting the new on-mask element 
pattern flavors.

Robin would know better, though.

>
> Andrew
Robin Dapp Sept. 6, 2024, 7:06 a.m. UTC | #6
> There were absolutely problems without this. It's a while ago now, so I'm
> struggling with the details, but as GCC only applies the mask to selected
> operations there were all sorts of issues that crept in. Zeroing the
> undefined lanes seemed to match the middle end assumptions (or, at least it
> made the UB consistent?) or maybe I read it in the code somewhere. Sorry,
> it's years since I wrote that.

So we only found two instances of this problem and both were related to
_Bools.  In case you have more cases, it would be greatly appreciated
to verify the series with them.  If you don't mind, would it be possible
to comment out the zeroing, re-run the testsuite and check for FAILs?

> This sounds like a generally good plan. Better than just zero it and hope
> that's right anyway. ;)
>
> So, in theory, is it better if amdgcn allows both? Or is that one little
> move immediate instruction in the backend going to produce better/cleaner
> middle end code?

The new predicate is supposed to inform the vectorizer of what it "prefers",
i.e. the hardware does anyway.  So if amdgcn leaves the inactive elements
undefined the predicate should only accept undefined as well.
Once the vectorizer requires zeros (or something else than undefined),
it will, explicity, emit a zeroing merge/blend in gimple.  That way the
zeroing can easily be combined with surrounding code.

Of course amdgcn could also advertise zero and then always force a zero before
loading as you currently do.  That would be unconditional, though, and the
combination with surrounding RTL might also be a bit more difficult than when
it's exposed in gimple already.

Thanks!
Andrew Stubbs Sept. 6, 2024, 8:37 a.m. UTC | #7
On 06/09/2024 08:06, Robin Dapp wrote:
>> There were absolutely problems without this. It's a while ago now, so I'm
>> struggling with the details, but as GCC only applies the mask to selected
>> operations there were all sorts of issues that crept in. Zeroing the
>> undefined lanes seemed to match the middle end assumptions (or, at least it
>> made the UB consistent?) or maybe I read it in the code somewhere. Sorry,
>> it's years since I wrote that.
> 
> So we only found two instances of this problem and both were related to
> _Bools.  In case you have more cases, it would be greatly appreciated
> to verify the series with them.  If you don't mind, would it be possible
> to comment out the zeroing, re-run the testsuite and check for FAILs?

I looked it up, and it was an execution failure in testcase 
gfortran.dg/assumed_rank_1.f90 that prompted me to add the initialization.

I believe I observed other cases of this too, but I can't find a list.

It shouldn't be too hard to run the test you suggest, but I won't have 
the results today.

>> This sounds like a generally good plan. Better than just zero it and hope
>> that's right anyway. ;)
>>
>> So, in theory, is it better if amdgcn allows both? Or is that one little
>> move immediate instruction in the backend going to produce better/cleaner
>> middle end code?
> 
> The new predicate is supposed to inform the vectorizer of what it "prefers",
> i.e. the hardware does anyway.  So if amdgcn leaves the inactive elements
> undefined the predicate should only accept undefined as well.
> Once the vectorizer requires zeros (or something else than undefined),
> it will, explicity, emit a zeroing merge/blend in gimple.  That way the
> zeroing can easily be combined with surrounding code.
> 
> Of course amdgcn could also advertise zero and then always force a zero before
> loading as you currently do.  That would be unconditional, though, and the
> combination with surrounding RTL might also be a bit more difficult than when
> it's exposed in gimple already.

OK, good to know, thanks!

Andrew
Robin Dapp Sept. 6, 2024, 8:47 a.m. UTC | #8
> > So we only found two instances of this problem and both were related to
> > _Bools.  In case you have more cases, it would be greatly appreciated
> > to verify the series with them.  If you don't mind, would it be possible
> > to comment out the zeroing, re-run the testsuite and check for FAILs?
>
> I looked it up, and it was an execution failure in testcase 
> gfortran.dg/assumed_rank_1.f90 that prompted me to add the initialization.

Ah, I saw that one as well here.  Thanks, will have a look locally.
Andrew Stubbs Sept. 10, 2024, 9:43 a.m. UTC | #9
On 06/09/2024 09:47, Robin Dapp wrote:
>>> So we only found two instances of this problem and both were related to
>>> _Bools.  In case you have more cases, it would be greatly appreciated
>>> to verify the series with them.  If you don't mind, would it be possible
>>> to comment out the zeroing, re-run the testsuite and check for FAILs?
>>
>> I looked it up, and it was an execution failure in testcase
>> gfortran.dg/assumed_rank_1.f90 that prompted me to add the initialization.
> 
> Ah, I saw that one as well here.  Thanks, will have a look locally.
> 

I ran the tests with the initialization removed, but I'm not too sure 
what to make of the results. There are 3 regressions and 8 progressions, 
but there's no consistency across the different devices (I tested 
gfx1100, gfx908, and gfx90a).

I'm going to rerun the tests and see if it does the same again, or if 
there's some randomness.

Andrew
Andrew Stubbs Sept. 11, 2024, 1:06 p.m. UTC | #10
On 10/09/2024 10:43, Andrew Stubbs wrote:
> On 06/09/2024 09:47, Robin Dapp wrote:
>>>> So we only found two instances of this problem and both were related to
>>>> _Bools.  In case you have more cases, it would be greatly appreciated
>>>> to verify the series with them.  If you don't mind, would it be 
>>>> possible
>>>> to comment out the zeroing, re-run the testsuite and check for FAILs?
>>>
>>> I looked it up, and it was an execution failure in testcase
>>> gfortran.dg/assumed_rank_1.f90 that prompted me to add the 
>>> initialization.
>>
>> Ah, I saw that one as well here.  Thanks, will have a look locally.
>>
> 
> I ran the tests with the initialization removed, but I'm not too sure 
> what to make of the results. There are 3 regressions and 8 progressions, 
> but there's no consistency across the different devices (I tested 
> gfx1100, gfx908, and gfx90a).
> 
> I'm going to rerun the tests and see if it does the same again, or if 
> there's some randomness.

There's definitely some random chance at play here: two identical test 
runs resulted in different results.

The problem test cases:
    gfortran.dg/all_bounds_1.f90
    gfortran.dg/allocated_4.f90
    gfortran.dg/team_change_1.f90
    gfortran.dg/class_optional_1.f90
    gcc.dg/torture/pr92152.c

Hope that helps

Andrew
Richard Biener Sept. 11, 2024, 1:21 p.m. UTC | #11
On Wed, 11 Sep 2024, Andrew Stubbs wrote:

> On 10/09/2024 10:43, Andrew Stubbs wrote:
> > On 06/09/2024 09:47, Robin Dapp wrote:
> >>>> So we only found two instances of this problem and both were related to
> >>>> _Bools.  In case you have more cases, it would be greatly appreciated
> >>>> to verify the series with them.  If you don't mind, would it be possible
> >>>> to comment out the zeroing, re-run the testsuite and check for FAILs?
> >>>
> >>> I looked it up, and it was an execution failure in testcase
> >>> gfortran.dg/assumed_rank_1.f90 that prompted me to add the initialization.
> >>
> >> Ah, I saw that one as well here.  Thanks, will have a look locally.
> >>
> > 
> > I ran the tests with the initialization removed, but I'm not too sure what
> > to make of the results. There are 3 regressions and 8 progressions, but
> > there's no consistency across the different devices (I tested gfx1100,
> > gfx908, and gfx90a).
> > 
> > I'm going to rerun the tests and see if it does the same again, or if
> > there's some randomness.
> 
> There's definitely some random chance at play here: two identical test runs
> resulted in different results.

I guess that's what you expect when you have UD elements that in
the end should have a defined value.

Richard.

> The problem test cases:
>    gfortran.dg/all_bounds_1.f90
>    gfortran.dg/allocated_4.f90
>    gfortran.dg/team_change_1.f90
>    gfortran.dg/class_optional_1.f90
>    gcc.dg/torture/pr92152.c
> 
> Hope that helps
> 
> Andrew
> 
>
diff mbox series

Patch

diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
index b24cf9be32e..2344bc00ffc 100644
--- a/gcc/config/gcn/gcn-valu.md
+++ b/gcc/config/gcn/gcn-valu.md
@@ -4002,7 +4002,8 @@  (define_expand "while_ultsidi"
 (define_expand "maskload<mode>di"
   [(match_operand:V_MOV 0 "register_operand")
    (match_operand:V_MOV 1 "memory_operand")
-   (match_operand 2 "")]
+   (match_operand 2 "")
+   (match_operand:V_MOV 3 "maskload_else_operand")]
   ""
   {
     rtx exec = force_reg (DImode, operands[2]);
@@ -4040,7 +4041,8 @@  (define_expand "mask_gather_load<mode><vnsi>"
    (match_operand:<VnSI> 2 "register_operand")
    (match_operand 3 "immediate_operand")
    (match_operand:SI 4 "gcn_alu_operand")
-   (match_operand:DI 5 "")]
+   (match_operand:DI 5 "")
+   (match_operand:V_MOV 6 "maskload_else_operand")]
   ""
   {
     rtx exec = force_reg (DImode, operands[5]);
diff --git a/gcc/config/gcn/predicates.md b/gcc/config/gcn/predicates.md
index 3f59396a649..9bc806cf990 100644
--- a/gcc/config/gcn/predicates.md
+++ b/gcc/config/gcn/predicates.md
@@ -228,3 +228,6 @@  (define_predicate "ascending_zero_int_parallel"
   return gcn_stepped_zero_int_parallel_p (op, 1);
 })
 
+(define_predicate "maskload_else_operand"
+  (and (match_code "const_int,const_vector")
+       (match_test "op == CONST0_RTX (GET_MODE (op))")))