Message ID | D3DDVMZWT931.3KMYVL9G36NRD@gmail.com |
---|---|
State | New |
Headers | show |
Series | [1/8] docs: Document maskload else operand and behavior. | expand |
On 8/11/24 3:00 PM, Robin Dapp wrote: > This patch adds a zero else operand to the masked loads. > > gcc/ChangeLog: > > * config/gcn/predicates.md (maskload_else_operand): New > predicate. > * config/gcn/gcn-valu.md: Use new predicate. OK if the GCN maintainers don't chime in by the time the generic bits are ACK'd. jeff
(Sorry, I missed this because I was on vacation.) On 11/08/2024 22:00, Robin Dapp wrote: > This patch adds a zero else operand to the masked loads. The patch is OK, but I have a question below. > gcc/ChangeLog: > > * config/gcn/predicates.md (maskload_else_operand): New > predicate. > * config/gcn/gcn-valu.md: Use new predicate. > --- > gcc/config/gcn/gcn-valu.md | 6 ++++-- > gcc/config/gcn/predicates.md | 3 +++ > 2 files changed, 7 insertions(+), 2 deletions(-) > > diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md > index b24cf9be32e..2344bc00ffc 100644 > --- a/gcc/config/gcn/gcn-valu.md > +++ b/gcc/config/gcn/gcn-valu.md > @@ -4002,7 +4002,8 @@ (define_expand "while_ultsidi" > (define_expand "maskload<mode>di" > [(match_operand:V_MOV 0 "register_operand") > (match_operand:V_MOV 1 "memory_operand") > - (match_operand 2 "")] > + (match_operand 2 "") > + (match_operand:V_MOV 3 "maskload_else_operand")] > "" > { > rtx exec = force_reg (DImode, operands[2]); > @@ -4040,7 +4041,8 @@ (define_expand "mask_gather_load<mode><vnsi>" > (match_operand:<VnSI> 2 "register_operand") > (match_operand 3 "immediate_operand") > (match_operand:SI 4 "gcn_alu_operand") > - (match_operand:DI 5 "")] > + (match_operand:DI 5 "") > + (match_operand:V_MOV 6 "maskload_else_operand")] > "" > { > rtx exec = force_reg (DImode, operands[5]); > diff --git a/gcc/config/gcn/predicates.md b/gcc/config/gcn/predicates.md > index 3f59396a649..9bc806cf990 100644 > --- a/gcc/config/gcn/predicates.md > +++ b/gcc/config/gcn/predicates.md > @@ -228,3 +228,6 @@ (define_predicate "ascending_zero_int_parallel" > return gcn_stepped_zero_int_parallel_p (op, 1); > }) > > +(define_predicate "maskload_else_operand" > + (and (match_code "const_int,const_vector") > + (match_test "op == CONST0_RTX (GET_MODE (op))"))) This forces maskload and mask_gather_load to only accept zero here, but in fact the hardware would allow us to accept any value (including undefined). Here's the expand code: /* Masked lanes are required to hold zero. */ emit_move_insn (operands[0], gcn_vec_constant (<MODE>mode, 0)); emit_insn (gen_gather<mode>_expr_exec (operands[0], addr, as, v, operands[0], exec)); In other words, initialize the whole vector to zero, and then use the gather_load instruction to implement the masked load (GCN does not have a contiguous-memory vector load instruction). We could easily omit the initialization instruction, or pass through the new value. Would there be any advantage to accepting other values, or is forcing zero actually the right choice? Thanks for the patch Andrew
> > +(define_predicate "maskload_else_operand" > > + (and (match_code "const_int,const_vector") > > + (match_test "op == CONST0_RTX (GET_MODE (op))"))) > > This forces maskload and mask_gather_load to only accept zero here, but > in fact the hardware would allow us to accept any value (including > undefined). > > Here's the expand code: > > /* Masked lanes are required to hold zero. */ > emit_move_insn (operands[0], gcn_vec_constant (<MODE>mode, 0)); > > emit_insn (gen_gather<mode>_expr_exec (operands[0], addr, as, v, > operands[0], exec)); Where does this requirement come from? Did you face problems when not setting the masked lanes to zero? I also realized the original problem with a gather load, but, as Richi noticed in the meanwhile, only happens with padding types. Generally the idea for the patch is to not require unconditional zeroing everything but also enable undefined (like riscv) and leave the decision to the vectorizer. So it would only zero if needed.
On Thu, 5 Sept 2024, 21:10 Robin Dapp, <rdapp.gcc@gmail.com> wrote: > > > +(define_predicate "maskload_else_operand" > > > + (and (match_code "const_int,const_vector") > > > + (match_test "op == CONST0_RTX (GET_MODE (op))"))) > > > > This forces maskload and mask_gather_load to only accept zero here, but > > in fact the hardware would allow us to accept any value (including > > undefined). > > > > Here's the expand code: > > > > /* Masked lanes are required to hold zero. */ > > emit_move_insn (operands[0], gcn_vec_constant (<MODE>mode, 0)); > > > > emit_insn (gen_gather<mode>_expr_exec (operands[0], addr, as, v, > > operands[0], exec)); > > Where does this requirement come from? Did you face problems when not > setting the masked lanes to zero? I also realized the original problem > with a gather load, but, as Richi noticed in the meanwhile, only happens > with padding types. > There were absolutely problems without this. It's a while ago now, so I'm struggling with the details, but as GCC only applies the mask to selected operations there were all sorts of issues that crept in. Zeroing the undefined lanes seemed to match the middle end assumptions (or, at least it made the UB consistent?) or maybe I read it in the code somewhere. Sorry, it's years since I wrote that. Generally the idea for the patch is to not require unconditional zeroing > everything but also enable undefined (like riscv) and leave the decision > to the vectorizer. So it would only zero if needed. This sounds like a generally good plan. Better than just zero it and hope that's right anyway. ;) So, in theory, is it better if amdgcn allows both? Or is that one little move immediate instruction in the backend going to produce better/cleaner middle end code? Andrew
On Thu, 05 Sep 2024 14:57:06 PDT (-0700), ams@baylibre.com wrote: > On Thu, 5 Sept 2024, 21:10 Robin Dapp, <rdapp.gcc@gmail.com> wrote: > >> > > +(define_predicate "maskload_else_operand" >> > > + (and (match_code "const_int,const_vector") >> > > + (match_test "op == CONST0_RTX (GET_MODE (op))"))) >> > >> > This forces maskload and mask_gather_load to only accept zero here, but >> > in fact the hardware would allow us to accept any value (including >> > undefined). >> > >> > Here's the expand code: >> > >> > /* Masked lanes are required to hold zero. */ >> > emit_move_insn (operands[0], gcn_vec_constant (<MODE>mode, 0)); >> > >> > emit_insn (gen_gather<mode>_expr_exec (operands[0], addr, as, v, >> > operands[0], exec)); >> >> Where does this requirement come from? Did you face problems when not >> setting the masked lanes to zero? I also realized the original problem >> with a gather load, but, as Richi noticed in the meanwhile, only happens >> with padding types. >> > > There were absolutely problems without this. It's a while ago now, so I'm > struggling with the details, but as GCC only applies the mask to selected > operations there were all sorts of issues that crept in. Zeroing the > undefined lanes seemed to match the middle end assumptions (or, at least it > made the UB consistent?) or maybe I read it in the code somewhere. Sorry, > it's years since I wrote that. Sorry to just randomly chime in here, but I bet Robin's asleep and I think you guys might just be talking past each other (and for some reason this one's poking my phone, so I kept seeing the messages). [Also something's a bit broken with the thread in general, it looks like some bits didn't make inbox.sourceware.org. They're all in my gmail, so hopefully they're possible to dig up on your end. Here's the best top-level I could find: https://inbox.sourceware.org/gcc-patches/D3DDV8VM4AR5.USIBT8RV5G5Z@gmail.com/] We ran into this unwritten "masked elements are zeroed" requirement in RISC-V as well, which resulted in a handful of bugs. The initial proposed workaround was to jam a bunch of zeroing into the backend, but that's both a bit inefficient and appears to have not fully worked due to some quirks of how the RISC-V vector extension configuration works <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115336>. So Robin now has this patch set to adujst the middle end to be explicit about what values it assumes for the unmasked elements, which we're then taking advantage of on RISC-V by not inserting the pre-zeroing. > Generally the idea for the patch is to not require unconditional zeroing >> everything but also enable undefined (like riscv) and leave the decision >> to the vectorizer. So it would only zero if needed. > > > This sounds like a generally good plan. Better than just zero it and hope > that's right anyway. ;) > > So, in theory, is it better if amdgcn allows both? Or is that one little > move immediate instruction in the backend going to produce better/cleaner > middle end code? IIUC the best code would come frome backends matching what the HW does and relying on the middle end to insert explicit ops to set the unmasked values where necessary. Unless the hardware has some faster way to insert the zeros on masked elements (which IIUC is how x86 and aarch64 work). I think Robin was just trying to adjust the backends to have no functional changes while still respecting the new on-mask element pattern flavors. Robin would know better, though. > > Andrew
> There were absolutely problems without this. It's a while ago now, so I'm > struggling with the details, but as GCC only applies the mask to selected > operations there were all sorts of issues that crept in. Zeroing the > undefined lanes seemed to match the middle end assumptions (or, at least it > made the UB consistent?) or maybe I read it in the code somewhere. Sorry, > it's years since I wrote that. So we only found two instances of this problem and both were related to _Bools. In case you have more cases, it would be greatly appreciated to verify the series with them. If you don't mind, would it be possible to comment out the zeroing, re-run the testsuite and check for FAILs? > This sounds like a generally good plan. Better than just zero it and hope > that's right anyway. ;) > > So, in theory, is it better if amdgcn allows both? Or is that one little > move immediate instruction in the backend going to produce better/cleaner > middle end code? The new predicate is supposed to inform the vectorizer of what it "prefers", i.e. the hardware does anyway. So if amdgcn leaves the inactive elements undefined the predicate should only accept undefined as well. Once the vectorizer requires zeros (or something else than undefined), it will, explicity, emit a zeroing merge/blend in gimple. That way the zeroing can easily be combined with surrounding code. Of course amdgcn could also advertise zero and then always force a zero before loading as you currently do. That would be unconditional, though, and the combination with surrounding RTL might also be a bit more difficult than when it's exposed in gimple already. Thanks!
On 06/09/2024 08:06, Robin Dapp wrote: >> There were absolutely problems without this. It's a while ago now, so I'm >> struggling with the details, but as GCC only applies the mask to selected >> operations there were all sorts of issues that crept in. Zeroing the >> undefined lanes seemed to match the middle end assumptions (or, at least it >> made the UB consistent?) or maybe I read it in the code somewhere. Sorry, >> it's years since I wrote that. > > So we only found two instances of this problem and both were related to > _Bools. In case you have more cases, it would be greatly appreciated > to verify the series with them. If you don't mind, would it be possible > to comment out the zeroing, re-run the testsuite and check for FAILs? I looked it up, and it was an execution failure in testcase gfortran.dg/assumed_rank_1.f90 that prompted me to add the initialization. I believe I observed other cases of this too, but I can't find a list. It shouldn't be too hard to run the test you suggest, but I won't have the results today. >> This sounds like a generally good plan. Better than just zero it and hope >> that's right anyway. ;) >> >> So, in theory, is it better if amdgcn allows both? Or is that one little >> move immediate instruction in the backend going to produce better/cleaner >> middle end code? > > The new predicate is supposed to inform the vectorizer of what it "prefers", > i.e. the hardware does anyway. So if amdgcn leaves the inactive elements > undefined the predicate should only accept undefined as well. > Once the vectorizer requires zeros (or something else than undefined), > it will, explicity, emit a zeroing merge/blend in gimple. That way the > zeroing can easily be combined with surrounding code. > > Of course amdgcn could also advertise zero and then always force a zero before > loading as you currently do. That would be unconditional, though, and the > combination with surrounding RTL might also be a bit more difficult than when > it's exposed in gimple already. OK, good to know, thanks! Andrew
> > So we only found two instances of this problem and both were related to > > _Bools. In case you have more cases, it would be greatly appreciated > > to verify the series with them. If you don't mind, would it be possible > > to comment out the zeroing, re-run the testsuite and check for FAILs? > > I looked it up, and it was an execution failure in testcase > gfortran.dg/assumed_rank_1.f90 that prompted me to add the initialization. Ah, I saw that one as well here. Thanks, will have a look locally.
On 06/09/2024 09:47, Robin Dapp wrote: >>> So we only found two instances of this problem and both were related to >>> _Bools. In case you have more cases, it would be greatly appreciated >>> to verify the series with them. If you don't mind, would it be possible >>> to comment out the zeroing, re-run the testsuite and check for FAILs? >> >> I looked it up, and it was an execution failure in testcase >> gfortran.dg/assumed_rank_1.f90 that prompted me to add the initialization. > > Ah, I saw that one as well here. Thanks, will have a look locally. > I ran the tests with the initialization removed, but I'm not too sure what to make of the results. There are 3 regressions and 8 progressions, but there's no consistency across the different devices (I tested gfx1100, gfx908, and gfx90a). I'm going to rerun the tests and see if it does the same again, or if there's some randomness. Andrew
On 10/09/2024 10:43, Andrew Stubbs wrote: > On 06/09/2024 09:47, Robin Dapp wrote: >>>> So we only found two instances of this problem and both were related to >>>> _Bools. In case you have more cases, it would be greatly appreciated >>>> to verify the series with them. If you don't mind, would it be >>>> possible >>>> to comment out the zeroing, re-run the testsuite and check for FAILs? >>> >>> I looked it up, and it was an execution failure in testcase >>> gfortran.dg/assumed_rank_1.f90 that prompted me to add the >>> initialization. >> >> Ah, I saw that one as well here. Thanks, will have a look locally. >> > > I ran the tests with the initialization removed, but I'm not too sure > what to make of the results. There are 3 regressions and 8 progressions, > but there's no consistency across the different devices (I tested > gfx1100, gfx908, and gfx90a). > > I'm going to rerun the tests and see if it does the same again, or if > there's some randomness. There's definitely some random chance at play here: two identical test runs resulted in different results. The problem test cases: gfortran.dg/all_bounds_1.f90 gfortran.dg/allocated_4.f90 gfortran.dg/team_change_1.f90 gfortran.dg/class_optional_1.f90 gcc.dg/torture/pr92152.c Hope that helps Andrew
On Wed, 11 Sep 2024, Andrew Stubbs wrote: > On 10/09/2024 10:43, Andrew Stubbs wrote: > > On 06/09/2024 09:47, Robin Dapp wrote: > >>>> So we only found two instances of this problem and both were related to > >>>> _Bools. In case you have more cases, it would be greatly appreciated > >>>> to verify the series with them. If you don't mind, would it be possible > >>>> to comment out the zeroing, re-run the testsuite and check for FAILs? > >>> > >>> I looked it up, and it was an execution failure in testcase > >>> gfortran.dg/assumed_rank_1.f90 that prompted me to add the initialization. > >> > >> Ah, I saw that one as well here. Thanks, will have a look locally. > >> > > > > I ran the tests with the initialization removed, but I'm not too sure what > > to make of the results. There are 3 regressions and 8 progressions, but > > there's no consistency across the different devices (I tested gfx1100, > > gfx908, and gfx90a). > > > > I'm going to rerun the tests and see if it does the same again, or if > > there's some randomness. > > There's definitely some random chance at play here: two identical test runs > resulted in different results. I guess that's what you expect when you have UD elements that in the end should have a defined value. Richard. > The problem test cases: > gfortran.dg/all_bounds_1.f90 > gfortran.dg/allocated_4.f90 > gfortran.dg/team_change_1.f90 > gfortran.dg/class_optional_1.f90 > gcc.dg/torture/pr92152.c > > Hope that helps > > Andrew > >
diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md index b24cf9be32e..2344bc00ffc 100644 --- a/gcc/config/gcn/gcn-valu.md +++ b/gcc/config/gcn/gcn-valu.md @@ -4002,7 +4002,8 @@ (define_expand "while_ultsidi" (define_expand "maskload<mode>di" [(match_operand:V_MOV 0 "register_operand") (match_operand:V_MOV 1 "memory_operand") - (match_operand 2 "")] + (match_operand 2 "") + (match_operand:V_MOV 3 "maskload_else_operand")] "" { rtx exec = force_reg (DImode, operands[2]); @@ -4040,7 +4041,8 @@ (define_expand "mask_gather_load<mode><vnsi>" (match_operand:<VnSI> 2 "register_operand") (match_operand 3 "immediate_operand") (match_operand:SI 4 "gcn_alu_operand") - (match_operand:DI 5 "")] + (match_operand:DI 5 "") + (match_operand:V_MOV 6 "maskload_else_operand")] "" { rtx exec = force_reg (DImode, operands[5]); diff --git a/gcc/config/gcn/predicates.md b/gcc/config/gcn/predicates.md index 3f59396a649..9bc806cf990 100644 --- a/gcc/config/gcn/predicates.md +++ b/gcc/config/gcn/predicates.md @@ -228,3 +228,6 @@ (define_predicate "ascending_zero_int_parallel" return gcn_stepped_zero_int_parallel_p (op, 1); }) +(define_predicate "maskload_else_operand" + (and (match_code "const_int,const_vector") + (match_test "op == CONST0_RTX (GET_MODE (op))")))