Message ID | 30906c0d-3b9f-e1e6-156f-c01fcf229cb9@linux.ibm.com |
---|---|
Headers | show |
Series | Support vector load/store with length | expand |
On Tue, 26 May 2020, Kewen.Lin wrote: > Hi all, > > This patch set adds support for vector load/store with length, Power > ISA 3.0 brings instructions lxvl/stxvl to perform vector load/store with > length, it's good to be exploited for those cases we don't have enough > stuffs to fill in the whole vector like epilogues. > > This support mainly refers to the handlings for fully-predicated loop > but it also covers the epilogue usage. Now it supports two modes > controlled by parameter vect-with-length-scope, it can support any > loops fully with length or just for those cases with small iteration > counts less than VF like epilogue, for now I don't have ready env to > benchmark it, but based on the current inefficient length generation, > I don't think it's a good idea to adopt vector with length for any loops. > For the main loop which used to be vectorized, it increases register > pressure and introduces extra computation for length, the pro for icache > seems not comparable. But I think it might be a good idea to keep this > parameter there for functionality testing, further benchmarking and other > ports' potential future supports. Can you explain in more detail what "vector load/store with length" does? Is that a "simplified" masked load/store which instead of masking arbitrary elements (and need a mask computed in the first place), masks elements > N (the length operand)? Thus assuming a lane IV decrementing to zero that IV would be the natural argument for the length operand? If that's correct, what data are the remaining lanes filled with? From a look at the series description below you seem to add a new way of doing loads for this. Did you review other ISAs (those I'm not familiar with myself too much are SVE, RISC-V and GCN) in GCC whether they have similar support and whether your approach can be supported there? ISTR SVE must have some similar support - what's the reason you do not piggy-back on that? I think a load like I described above might be represented as _1 = __VIEW_CONVERT <v4df_t> (__MEM <double[n_2]> ((double *)p_3)); not sure if that actually works out though. But given it seems it is a contiguous load we shouldn't need an internal function here? [there's a possible size mismatch in the __VIEW_CONVERT above, I guess on RTL you end up with a paradoxical subreg - or an UNSPEC] That said, I'm not very happy seeing yet another way of doing loads [for fully predicated loops]. I'd rather like to see a single representation on GIMPLE at least. Will dig into the patch once the actual workings of those load/store with length is confirmed. I don't spot tree-vect-slp.c being changed - maybe that's not necessary for SLP operation, but please do not introduce new vectorizer features without supporting SLP operation at this point. Thanks, Richard. > As we don't have any benchmarking, this support isn't enabled by default > for any particular cpus, all testings are with explicit parameter setting. > > Bootstrapped on powerpc64le-linux-gnu P9 with all vect-with-length-scope > settings (0/1/2). Regress-test passed with vector-with-length-scope 0, > for the other twos, several vector related cases need to be updated, no > remarkable failures found. BTW, P9 is the one which supports the > functionality but not ready to evaluate the performance. > > Here still are many things to be supported or improved, not limited to: > - reduction/live-out support > - Cost model adding/tweaking > - IFN gimple folding > - Some unnecessary ops improvements eg: vector_size check > - Some possible refactoring > I'll support/post the patches gradually. > > Any comments are highly appreciated. > > BR, > Kewen > ----- > > Patch set outline: > [PATCH 1/7] ifn/optabs: Support vector load/store with length > [PATCH 2/7] rs6000: lenload/lenstore optab support > [PATCH 3/7] vect: Factor out codes for niters smaller than vf check > [PATCH 4/7] hook/rs6000: Add vectorize length mode for vector with length > [PATCH 5/7] vect: Support vector load/store with length in vectorizer > [PATCH 6/7] ivopts: Add handlings for vector with length IFNs > [PATCH 7/7] rs6000/testsuite: Vector with length test cases > > gcc/config/rs6000/rs6000.c | 3 + > gcc/config/rs6000/vsx.md | 30 ++++++++++ > gcc/doc/invoke.texi | 7 +++ > gcc/doc/md.texi | 16 ++++++ > gcc/doc/tm.texi | 6 ++ > gcc/doc/tm.texi.in | 2 + > gcc/internal-fn.c | 13 ++++- > gcc/internal-fn.def | 6 ++ > gcc/optabs.def | 2 + > gcc/params.opt | 4 ++ > gcc/target.def | 7 +++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-1.h | 18 ++++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-2.h | 17 ++++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-3.h | 31 +++++++++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-4.h | 24 ++++++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-5.h | 29 ++++++++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-6.h | 32 +++++++++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-1.c | 15 +++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-2.c | 15 +++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-3.c | 18 ++++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-4.c | 15 +++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-5.c | 15 +++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-6.c | 16 ++++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-1.c | 10 ++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-2.c | 10 ++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-3.c | 10 ++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-4.c | 10 ++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-5.c | 10 ++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-6.c | 10 ++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-1.c | 16 ++++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-2.c | 16 ++++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-3.c | 17 ++++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-4.c | 16 ++++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-5.c | 16 ++++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-6.c | 16 ++++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-1.c | 10 ++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-2.c | 10 ++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-3.c | 10 ++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-4.c | 10 ++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-5.c | 10 ++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-6.c | 10 ++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-1.h | 34 ++++++++++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-2.h | 36 ++++++++++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-3.h | 34 ++++++++++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-4.h | 62 +++++++++++++++++++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-5.h | 45 +++++++++++++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-6.h | 52 +++++++++++++++++ > gcc/testsuite/gcc.target/powerpc/p9-vec-length.h | 14 +++++ > gcc/tree-ssa-loop-ivopts.c | 4 ++ > gcc/tree-vect-loop-manip.c | 268 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- > gcc/tree-vect-loop.c | 272 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----- > gcc/tree-vect-stmts.c | 152 ++++++++++++++++++++++++++++++++++++++++++++++++++ > gcc/tree-vectorizer.h | 32 +++++++++++ > 53 files changed, 1545 insertions(+), 18 deletions(-) >
Hi Richi, on 2020/5/26 下午3:12, Richard Biener wrote: > On Tue, 26 May 2020, Kewen.Lin wrote: > >> Hi all, >> >> This patch set adds support for vector load/store with length, Power >> ISA 3.0 brings instructions lxvl/stxvl to perform vector load/store with >> length, it's good to be exploited for those cases we don't have enough >> stuffs to fill in the whole vector like epilogues. >> >> This support mainly refers to the handlings for fully-predicated loop >> but it also covers the epilogue usage. Now it supports two modes >> controlled by parameter vect-with-length-scope, it can support any >> loops fully with length or just for those cases with small iteration >> counts less than VF like epilogue, for now I don't have ready env to >> benchmark it, but based on the current inefficient length generation, >> I don't think it's a good idea to adopt vector with length for any loops. >> For the main loop which used to be vectorized, it increases register >> pressure and introduces extra computation for length, the pro for icache >> seems not comparable. But I think it might be a good idea to keep this >> parameter there for functionality testing, further benchmarking and other >> ports' potential future supports. > > Can you explain in more detail what "vector load/store with length" does? > Is that a "simplified" masked load/store which instead of masking > arbitrary elements (and need a mask computed in the first place), masks > elements > N (the length operand)? Thus assuming a lane IV decrementing > to zero that IV would be the natural argument for the length operand? > If that's correct, what data are the remaining lanes filled with? > The vector load/store have one GPR holding one length in bytes (called as n here) to control how many bytes we want to load/store. If n > vector_size (on Power it's 16), n will be taken as 16, if n is zero, the storage access won't happen, the result for load is vector zero, if n > 0 but < vector_size, the remaining lanes will be filled with zero. On Power, it checks 0:7 bits of the length GPR, so the length should be adjusted. Your understanding is correct! It's a "simplified" masked load/store, need the length in bytes computed, only support continuous access. For the lane IV, the one should multiply with the lane bytes and be adjusted if needed. > From a look at the series description below you seem to add a new way > of doing loads for this. Did you review other ISAs (those I'm not > familiar with myself too much are SVE, RISC-V and GCN) in GCC whether > they have similar support and whether your approach can be supported > there? ISTR SVE must have some similar support - what's the reason > you do not piggy-back on that? I may miss something, but I didn't find there is any support that meet with this in GCC. Good suggestion on ISAs, I didn't review other ISAs, for the current implementation, I heavily referred to existing SVE fully predicated loop support, it's stronger than "with length", it can deal with arbitrary elements, only perform for the loop fully etc. > > I think a load like I described above might be represented as > > _1 = __VIEW_CONVERT <v4df_t> (__MEM <double[n_2]> ((double *)p_3)); > > not sure if that actually works out though. But given it seems it > is a contiguous load we shouldn't need an internal function here? > [there's a possible size mismatch in the __VIEW_CONVERT above, I guess > on RTL you end up with a paradoxical subreg - or an UNSPEC] IIUC, what you suggested is to avoid use IFN here. May I know the reason? is it due to the maintenance efforts on various IFNs? I thought using IFN is gentle. And agreed, I had the concern that the size mismatch problem here since the length can be large (extremely large probably, it depends on target saturated limit), the converted vector size can be small. Besides, the length can be a variable. > > That said, I'm not very happy seeing yet another way of doing loads > [for fully predicated loops]. I'd rather like to see a single > representation on GIMPLE at least. OK. Does it mean IFN isn't counted into this scope? :) > > Will dig into the patch once the actual workings of those load/store with > length is confirmed. Thanks a lot for your time! > > I don't spot tree-vect-slp.c being changed - maybe that's not necessary > for SLP operation, but please do not introduce new vectorizer features > without supporting SLP operation at this point. Got it. SLP is also one part to be supported, I forgot to state in the original email. I'll let it be for now. :) BR, Kewen
On Tue, 26 May 2020, Kewen.Lin wrote: > Hi Richi, > > on 2020/5/26 下午3:12, Richard Biener wrote: > > On Tue, 26 May 2020, Kewen.Lin wrote: > > > >> Hi all, > >> > >> This patch set adds support for vector load/store with length, Power > >> ISA 3.0 brings instructions lxvl/stxvl to perform vector load/store with > >> length, it's good to be exploited for those cases we don't have enough > >> stuffs to fill in the whole vector like epilogues. > >> > >> This support mainly refers to the handlings for fully-predicated loop > >> but it also covers the epilogue usage. Now it supports two modes > >> controlled by parameter vect-with-length-scope, it can support any > >> loops fully with length or just for those cases with small iteration > >> counts less than VF like epilogue, for now I don't have ready env to > >> benchmark it, but based on the current inefficient length generation, > >> I don't think it's a good idea to adopt vector with length for any loops. > >> For the main loop which used to be vectorized, it increases register > >> pressure and introduces extra computation for length, the pro for icache > >> seems not comparable. But I think it might be a good idea to keep this > >> parameter there for functionality testing, further benchmarking and other > >> ports' potential future supports. > > > > Can you explain in more detail what "vector load/store with length" does? > > Is that a "simplified" masked load/store which instead of masking > > arbitrary elements (and need a mask computed in the first place), masks > > elements > N (the length operand)? Thus assuming a lane IV decrementing > > to zero that IV would be the natural argument for the length operand? > > If that's correct, what data are the remaining lanes filled with? > > > > The vector load/store have one GPR holding one length in bytes (called as > n here) to control how many bytes we want to load/store. If n > vector_size > (on Power it's 16), n will be taken as 16, if n is zero, the storage access > won't happen, the result for load is vector zero, if n > 0 but < vector_size, > the remaining lanes will be filled with zero. On Power, it checks 0:7 bits > of the length GPR, so the length should be adjusted. > > Your understanding is correct! It's a "simplified" masked load/store, need > the length in bytes computed, only support continuous access. For the lane > IV, the one should multiply with the lane bytes and be adjusted if needed. > > > From a look at the series description below you seem to add a new way > > of doing loads for this. Did you review other ISAs (those I'm not > > familiar with myself too much are SVE, RISC-V and GCN) in GCC whether > > they have similar support and whether your approach can be supported > > there? ISTR SVE must have some similar support - what's the reason > > you do not piggy-back on that? > > I may miss something, but I didn't find there is any support that meet with > this in GCC. Good suggestion on ISAs, I didn't review other ISAs, for the > current implementation, I heavily referred to existing SVE fully predicated > loop support, it's stronger than "with length", it can deal with arbitrary > elements, only perform for the loop fully etc. > > > > > I think a load like I described above might be represented as > > > > _1 = __VIEW_CONVERT <v4df_t> (__MEM <double[n_2]> ((double *)p_3)); > > > > not sure if that actually works out though. But given it seems it > > is a contiguous load we shouldn't need an internal function here? > > [there's a possible size mismatch in the __VIEW_CONVERT above, I guess > > on RTL you end up with a paradoxical subreg - or an UNSPEC] > > IIUC, what you suggested is to avoid use IFN here. May I know the reason? > is it due to the maintenance efforts on various IFNs? I thought using > IFN is gentle. And agreed, I had the concern that the size mismatch > problem here since the length can be large (extremely large probably, it > depends on target saturated limit), the converted vector size can be small. > Besides, the length can be a variable. > > > > > That said, I'm not very happy seeing yet another way of doing loads > > [for fully predicated loops]. I'd rather like to see a single > > representation on GIMPLE at least. > > OK. Does it mean IFN isn't counted into this scope? :) Sure going with a new IFN is a possibility. I'd just avoid that if possible ;) How does SVE represent loads/stores for whilelt loops? Would it not be possible to share that representation somehow? Richard. > > > > Will dig into the patch once the actual workings of those load/store with > > length is confirmed. > > Thanks a lot for your time! > > > > > I don't spot tree-vect-slp.c being changed - maybe that's not necessary > > for SLP operation, but please do not introduce new vectorizer features > > without supporting SLP operation at this point. > > Got it. SLP is also one part to be supported, I forgot to state in the > original email. I'll let it be for now. :) > > BR, > Kewen >
on 2020/5/26 下午5:44, Richard Biener wrote: > On Tue, 26 May 2020, Kewen.Lin wrote: > >> Hi Richi, >> >> on 2020/5/26 下午3:12, Richard Biener wrote: >>> On Tue, 26 May 2020, Kewen.Lin wrote: >>> >>>> Hi all, >>>> >>>> This patch set adds support for vector load/store with length, Power >>>> ISA 3.0 brings instructions lxvl/stxvl to perform vector load/store with >>>> length, it's good to be exploited for those cases we don't have enough >>>> stuffs to fill in the whole vector like epilogues. >>>> >>>> This support mainly refers to the handlings for fully-predicated loop >>>> but it also covers the epilogue usage. Now it supports two modes >>>> controlled by parameter vect-with-length-scope, it can support any >>>> loops fully with length or just for those cases with small iteration >>>> counts less than VF like epilogue, for now I don't have ready env to >>>> benchmark it, but based on the current inefficient length generation, >>>> I don't think it's a good idea to adopt vector with length for any loops. >>>> For the main loop which used to be vectorized, it increases register >>>> pressure and introduces extra computation for length, the pro for icache >>>> seems not comparable. But I think it might be a good idea to keep this >>>> parameter there for functionality testing, further benchmarking and other >>>> ports' potential future supports. >>> >>> Can you explain in more detail what "vector load/store with length" does? >>> Is that a "simplified" masked load/store which instead of masking >>> arbitrary elements (and need a mask computed in the first place), masks >>> elements > N (the length operand)? Thus assuming a lane IV decrementing >>> to zero that IV would be the natural argument for the length operand? >>> If that's correct, what data are the remaining lanes filled with? >>> >> >> The vector load/store have one GPR holding one length in bytes (called as >> n here) to control how many bytes we want to load/store. If n > vector_size >> (on Power it's 16), n will be taken as 16, if n is zero, the storage access >> won't happen, the result for load is vector zero, if n > 0 but < vector_size, >> the remaining lanes will be filled with zero. On Power, it checks 0:7 bits >> of the length GPR, so the length should be adjusted. >> >> Your understanding is correct! It's a "simplified" masked load/store, need >> the length in bytes computed, only support continuous access. For the lane >> IV, the one should multiply with the lane bytes and be adjusted if needed. >> >>> From a look at the series description below you seem to add a new way >>> of doing loads for this. Did you review other ISAs (those I'm not >>> familiar with myself too much are SVE, RISC-V and GCN) in GCC whether >>> they have similar support and whether your approach can be supported >>> there? ISTR SVE must have some similar support - what's the reason >>> you do not piggy-back on that? >> >> I may miss something, but I didn't find there is any support that meet with >> this in GCC. Good suggestion on ISAs, I didn't review other ISAs, for the >> current implementation, I heavily referred to existing SVE fully predicated >> loop support, it's stronger than "with length", it can deal with arbitrary >> elements, only perform for the loop fully etc. >> >>> >>> I think a load like I described above might be represented as >>> >>> _1 = __VIEW_CONVERT <v4df_t> (__MEM <double[n_2]> ((double *)p_3)); >>> >>> not sure if that actually works out though. But given it seems it >>> is a contiguous load we shouldn't need an internal function here? >>> [there's a possible size mismatch in the __VIEW_CONVERT above, I guess >>> on RTL you end up with a paradoxical subreg - or an UNSPEC] >> >> IIUC, what you suggested is to avoid use IFN here. May I know the reason? >> is it due to the maintenance efforts on various IFNs? I thought using >> IFN is gentle. And agreed, I had the concern that the size mismatch >> problem here since the length can be large (extremely large probably, it >> depends on target saturated limit), the converted vector size can be small. >> Besides, the length can be a variable. >> >>> >>> That said, I'm not very happy seeing yet another way of doing loads >>> [for fully predicated loops]. I'd rather like to see a single >>> representation on GIMPLE at least. >> >> OK. Does it mean IFN isn't counted into this scope? :) > > Sure going with a new IFN is a possibility. I'd just avoid that if > possible ;) How does SVE represent loads/stores for whilelt loops? > Would it not be possible to share that representation somehow? > > Richard. > Got it. :) Currently SVE uses IFNs .MASK_LOAD and .MASK_STORE for whileult loops: vect__1.5_14 = .MASK_LOAD (_11, 4B, loop_mask_9); ... .MASK_STORE (_1, 4B, loop_mask_9, vect__3.9_18); ... next_mask_26 = .WHILE_ULT (_2, 127, { 0, ... }); if (next_mask_26 != { 0, ... }) I thought to share it once, but didn't feel it's a good idea since it's like we take something as masks that isn't actually masks, easily confuse people. OTOH, Power or other potential targets probably supports these kinds of masking instructions, easily to confuse people further. But I definitely agree that we can refactor some codes to share with each other when possible. Btw, here the code with proposed IFNs looks like: vect__1.5_4 = .LEN_LOAD (_15, 4B, loop_len_5); ... .LEN_STORE (_1, 4B, loop_len_5, vect__3.9_18); ivtmp_23 = ivtmp_22 + 16; _25 = MIN_EXPR <ivtmp_23, 508>; _26 = 508 - _25; _27 = MIN_EXPR <_26, 16>; if (ivtmp_23 <= 507) BR, Kewen >>> >>> Will dig into the patch once the actual workings of those load/store with >>> length is confirmed. >> >> Thanks a lot for your time! >> >>> >>> I don't spot tree-vect-slp.c being changed - maybe that's not necessary >>> for SLP operation, but please do not introduce new vectorizer features >>> without supporting SLP operation at this point. >> >> Got it. SLP is also one part to be supported, I forgot to state in the >> original email. I'll let it be for now. :) >> >> BR, >> Kewen >> >
"Kewen.Lin" <linkw@linux.ibm.com> writes: > on 2020/5/26 下午5:44, Richard Biener wrote: >> On Tue, 26 May 2020, Kewen.Lin wrote: >> >>> Hi Richi, >>> >>> on 2020/5/26 下午3:12, Richard Biener wrote: >>>> On Tue, 26 May 2020, Kewen.Lin wrote: >>>> >>>>> Hi all, >>>>> >>>>> This patch set adds support for vector load/store with length, Power >>>>> ISA 3.0 brings instructions lxvl/stxvl to perform vector load/store with >>>>> length, it's good to be exploited for those cases we don't have enough >>>>> stuffs to fill in the whole vector like epilogues. >>>>> >>>>> This support mainly refers to the handlings for fully-predicated loop >>>>> but it also covers the epilogue usage. Now it supports two modes >>>>> controlled by parameter vect-with-length-scope, it can support any >>>>> loops fully with length or just for those cases with small iteration >>>>> counts less than VF like epilogue, for now I don't have ready env to >>>>> benchmark it, but based on the current inefficient length generation, >>>>> I don't think it's a good idea to adopt vector with length for any loops. >>>>> For the main loop which used to be vectorized, it increases register >>>>> pressure and introduces extra computation for length, the pro for icache >>>>> seems not comparable. But I think it might be a good idea to keep this >>>>> parameter there for functionality testing, further benchmarking and other >>>>> ports' potential future supports. >>>> >>>> Can you explain in more detail what "vector load/store with length" does? >>>> Is that a "simplified" masked load/store which instead of masking >>>> arbitrary elements (and need a mask computed in the first place), masks >>>> elements > N (the length operand)? Thus assuming a lane IV decrementing >>>> to zero that IV would be the natural argument for the length operand? >>>> If that's correct, what data are the remaining lanes filled with? >>>> >>> >>> The vector load/store have one GPR holding one length in bytes (called as >>> n here) to control how many bytes we want to load/store. If n > vector_size >>> (on Power it's 16), n will be taken as 16, if n is zero, the storage access >>> won't happen, the result for load is vector zero, if n > 0 but < vector_size, >>> the remaining lanes will be filled with zero. On Power, it checks 0:7 bits >>> of the length GPR, so the length should be adjusted. >>> >>> Your understanding is correct! It's a "simplified" masked load/store, need >>> the length in bytes computed, only support continuous access. For the lane >>> IV, the one should multiply with the lane bytes and be adjusted if needed. >>> >>>> From a look at the series description below you seem to add a new way >>>> of doing loads for this. Did you review other ISAs (those I'm not >>>> familiar with myself too much are SVE, RISC-V and GCN) in GCC whether >>>> they have similar support and whether your approach can be supported >>>> there? ISTR SVE must have some similar support - what's the reason >>>> you do not piggy-back on that? >>> >>> I may miss something, but I didn't find there is any support that meet with >>> this in GCC. Good suggestion on ISAs, I didn't review other ISAs, for the >>> current implementation, I heavily referred to existing SVE fully predicated >>> loop support, it's stronger than "with length", it can deal with arbitrary >>> elements, only perform for the loop fully etc. >>> >>>> >>>> I think a load like I described above might be represented as >>>> >>>> _1 = __VIEW_CONVERT <v4df_t> (__MEM <double[n_2]> ((double *)p_3)); >>>> >>>> not sure if that actually works out though. But given it seems it >>>> is a contiguous load we shouldn't need an internal function here? >>>> [there's a possible size mismatch in the __VIEW_CONVERT above, I guess >>>> on RTL you end up with a paradoxical subreg - or an UNSPEC] >>> >>> IIUC, what you suggested is to avoid use IFN here. May I know the reason? >>> is it due to the maintenance efforts on various IFNs? I thought using >>> IFN is gentle. And agreed, I had the concern that the size mismatch >>> problem here since the length can be large (extremely large probably, it >>> depends on target saturated limit), the converted vector size can be small. >>> Besides, the length can be a variable. >>> >>>> >>>> That said, I'm not very happy seeing yet another way of doing loads >>>> [for fully predicated loops]. I'd rather like to see a single >>>> representation on GIMPLE at least. >>> >>> OK. Does it mean IFN isn't counted into this scope? :) >> >> Sure going with a new IFN is a possibility. I'd just avoid that if >> possible ;) How does SVE represent loads/stores for whilelt loops? >> Would it not be possible to share that representation somehow? >> >> Richard. >> > > Got it. :) Currently SVE uses IFNs .MASK_LOAD and .MASK_STORE for > whileult loops: > > vect__1.5_14 = .MASK_LOAD (_11, 4B, loop_mask_9); > ... > .MASK_STORE (_1, 4B, loop_mask_9, vect__3.9_18); > ... > next_mask_26 = .WHILE_ULT (_2, 127, { 0, ... }); > if (next_mask_26 != { 0, ... }) > > I thought to share it once, but didn't feel it's a good idea since it's > like we take something as masks that isn't actually masks, easily confuse > people. OTOH, Power or other potential targets probably supports these > kinds of masking instructions, easily to confuse people further. > > But I definitely agree that we can refactor some codes to share with > each other when possible. > > Btw, here the code with proposed IFNs looks like: > > vect__1.5_4 = .LEN_LOAD (_15, 4B, loop_len_5); > ... > .LEN_STORE (_1, 4B, loop_len_5, vect__3.9_18); > ivtmp_23 = ivtmp_22 + 16; > _25 = MIN_EXPR <ivtmp_23, 508>; > _26 = 508 - _25; > _27 = MIN_EXPR <_26, 16>; > if (ivtmp_23 <= 507) FWIW, I agree adding .LEN_LOAD and .LEN_STORE seems like a good approach. I think it'll be more maintainable in the long run than trying to have .MASK_LOADs and .MASK_STOREs that need a special mask operand. (That would be too similar to VEC_COND_EXPR :-)) Not sure yet what the exact semantics wrt out-of-range values for the IFN/optab though. Maybe we should instead have some kind of abstract, target-specific cookie created by a separate intrinsic. Haven't thought much about it yet... Thanks, Richard
On Tue, May 26, 2020 at 12:12 AM Richard Biener <rguenther@suse.de> wrote: > From a look at the series description below you seem to add a new way > of doing loads for this. Did you review other ISAs (those I'm not > familiar with myself too much are SVE, RISC-V and GCN) in GCC whether > they have similar support and whether your approach can be supported > there? ISTR SVE must have some similar support - what's the reason > you do not piggy-back on that? There isn't any RISC-V Vector support in GCC yet. The RVV spec is still in draft and still occasionally changing in incompatible ways. We've done some experimenting with gcc patches, but all we have are intrinsics. We haven't implemented any auto vectorization support, so we haven't defined tree representations for anything yet, other than the types we need for intrinsics support. But if it looks OK for SVE then it probably will be OK for RVV. Jim
Hi! On Tue, May 26, 2020 at 01:29:30PM +0100, Richard Sandiford wrote: > FWIW, I agree adding .LEN_LOAD and .LEN_STORE seems like a good > approach. I think it'll be more maintainable in the long run than > trying to have .MASK_LOADs and .MASK_STOREs that need a special mask > operand. (That would be too similar to VEC_COND_EXPR :-)) > > Not sure yet what the exact semantics wrt out-of-range values for > the IFN/optab though. Maybe we should instead have some kind of > abstract, target-specific cookie created by a separate intrinsic. > Haven't thought much about it yet... Or maybe only support 0..N with N the length of the vector? It is pretty important to support 0 and N, but greater than N isn't as important (it is useful for tricky hand-written code, but not as much for compiler-generate code -- we only support an 8-bit number here on Power, maybe that is why ;-) ) Segher
On Tue, 26 May 2020, Jim Wilson wrote: > On Tue, May 26, 2020 at 12:12 AM Richard Biener <rguenther@suse.de> wrote: > > From a look at the series description below you seem to add a new way > > of doing loads for this. Did you review other ISAs (those I'm not > > familiar with myself too much are SVE, RISC-V and GCN) in GCC whether > > they have similar support and whether your approach can be supported > > there? ISTR SVE must have some similar support - what's the reason > > you do not piggy-back on that? > > There isn't any RISC-V Vector support in GCC yet. The RVV spec is > still in draft and still occasionally changing in incompatible ways. > We've done some experimenting with gcc patches, but all we have are > intrinsics. We haven't implemented any auto vectorization support, so > we haven't defined tree representations for anything yet, other than > the types we need for intrinsics support. But if it looks OK for SVE > then it probably will be OK for RVV. Btw, I'm specifically looking for other load/store with length implementations and as to whether they agree on taking bytes for the length rather than, for example the number of lanes. I guess exposing this detail on GIMPLE can help IV selection but if we'd ever get a differing semantics ISA we'd have to add another set of IFNs, so maybe the PPC ones should be named in a more specific way like _WITH_BYTES or _BYTES or _WITH_BYTE_LENGTH or so to allow _WITH_LANES? Richard. > Jim >
On Tue, 26 May 2020, Segher Boessenkool wrote: > Hi! > > On Tue, May 26, 2020 at 01:29:30PM +0100, Richard Sandiford wrote: > > FWIW, I agree adding .LEN_LOAD and .LEN_STORE seems like a good > > approach. I think it'll be more maintainable in the long run than > > trying to have .MASK_LOADs and .MASK_STOREs that need a special mask > > operand. (That would be too similar to VEC_COND_EXPR :-)) > > > > Not sure yet what the exact semantics wrt out-of-range values for > > the IFN/optab though. Maybe we should instead have some kind of > > abstract, target-specific cookie created by a separate intrinsic. > > Haven't thought much about it yet... > > Or maybe only support 0..N with N the length of the vector? It is > pretty important to support 0 and N, but greater than N isn't as > important (it is useful for tricky hand-written code, but not as much > for compiler-generate code -- we only support an 8-bit number here on > Power, maybe that is why ;-) ) The question is one of semantics - if power masks the length to an 8 bit number it's important to preprocess the IV. As with my other suggestion the question is what to expose to the IL (to GIMPLE) here. Exposing as much as possible will help IV selection but will eventually require IFN variations for different semantics. So yes, 0..N sounds about right here and we'll require a MIN () operation and likely need to teach IV selection about this to at least possibly get an IV with the byte size multiplication factored. Richard. > > Segher >
Richard Biener <rguenther@suse.de> writes: > On Tue, 26 May 2020, Jim Wilson wrote: > >> On Tue, May 26, 2020 at 12:12 AM Richard Biener <rguenther@suse.de> wrote: >> > From a look at the series description below you seem to add a new way >> > of doing loads for this. Did you review other ISAs (those I'm not >> > familiar with myself too much are SVE, RISC-V and GCN) in GCC whether >> > they have similar support and whether your approach can be supported >> > there? ISTR SVE must have some similar support - what's the reason >> > you do not piggy-back on that? >> >> There isn't any RISC-V Vector support in GCC yet. The RVV spec is >> still in draft and still occasionally changing in incompatible ways. >> We've done some experimenting with gcc patches, but all we have are >> intrinsics. We haven't implemented any auto vectorization support, so >> we haven't defined tree representations for anything yet, other than >> the types we need for intrinsics support. But if it looks OK for SVE >> then it probably will be OK for RVV. > > Btw, I'm specifically looking for other load/store with length > implementations and as to whether they agree on taking bytes for > the length rather than, for example the number of lanes. I guess > exposing this detail on GIMPLE can help IV selection but if we'd > ever get a differing semantics ISA we'd have to add another set > of IFNs, so maybe the PPC ones should be named in a more specific > way like _WITH_BYTES or _BYTES or _WITH_BYTE_LENGTH or so to > allow _WITH_LANES? Maybe that detail is another thing that a cookie could hide. We'd then potentially need one IFN per approach to calculating the length parameter (bytes vs. elements, self-capping vs. explicit capping, etc.), but it would only be one IFN per approach, rather than the combinatorial explosion we'd get from one IFN per approach*load/store-kind. It doesn't make much difference when we only have one LOAD and one STORE per approach. But I imagine this will be useful for MVE, and there we'll want extending loads, truncating stores, gathers and scatters too. Thanks, Richard
on 2020/5/27 下午3:25, Richard Biener wrote: > On Tue, 26 May 2020, Segher Boessenkool wrote: > >> Hi! >> >> On Tue, May 26, 2020 at 01:29:30PM +0100, Richard Sandiford wrote: >>> FWIW, I agree adding .LEN_LOAD and .LEN_STORE seems like a good >>> approach. I think it'll be more maintainable in the long run than >>> trying to have .MASK_LOADs and .MASK_STOREs that need a special mask >>> operand. (That would be too similar to VEC_COND_EXPR :-)) >>> >>> Not sure yet what the exact semantics wrt out-of-range values for >>> the IFN/optab though. Maybe we should instead have some kind of >>> abstract, target-specific cookie created by a separate intrinsic. >>> Haven't thought much about it yet... >> >> Or maybe only support 0..N with N the length of the vector? It is >> pretty important to support 0 and N, but greater than N isn't as >> important (it is useful for tricky hand-written code, but not as much >> for compiler-generate code -- we only support an 8-bit number here on >> Power, maybe that is why ;-) ) > > The question is one of semantics - if power masks the length to an > 8 bit number it's important to preprocess the IV. As with my > other suggestion the question is what to expose to the IL (to GIMPLE) > here. Exposing as much as possible will help IV selection but > will eventually require IFN variations for different semantics. > In the current implementation, we don't use IFN for the length computation, it has something like: ivtmp_28 = ivtmp_27 + 16; _39 = MIN_EXPR <ivtmp_28, _32>; // _32 is the limit _40 = _32 - _39; // get the zero bytes for the ending _41 = MIN_EXPR <_40, 16>; // check for vector size if (ivtmp_28 < _32) In my initial thought, the len load/store IFNs are considered to accept any lengths (any values hold in length mode), since the length larger than vector size is no sense, the hardware can take it as saturated to vector size, if hardware has some masking bits on it like ppc, we can add one hook to guard the MIN requirement for length gen. For now, the MIN is mandatory since ppc is the only user. FWIW, if we mostly adopt this for epilogues or small loop (iteration < VF), the range can be analyzed during compilation time, these MIN computations can be optimized theoricially. > So yes, 0..N sounds about right here and we'll require a MIN () > operation and likely need to teach IV selection about this to at least > possibly get an IV with the byte size multiplication factored. > FWIW, in the current implementation, the step/limit have multiplied the bytes of lanes first, the IV computation will not have the multilcation for it there. BR, Kewen > Richard. > >> >> Segher >> >
On Wed, May 27, 2020 at 09:25:43AM +0200, Richard Biener wrote: > On Tue, 26 May 2020, Segher Boessenkool wrote: > > On Tue, May 26, 2020 at 01:29:30PM +0100, Richard Sandiford wrote: > > > FWIW, I agree adding .LEN_LOAD and .LEN_STORE seems like a good > > > approach. I think it'll be more maintainable in the long run than > > > trying to have .MASK_LOADs and .MASK_STOREs that need a special mask > > > operand. (That would be too similar to VEC_COND_EXPR :-)) > > > > > > Not sure yet what the exact semantics wrt out-of-range values for > > > the IFN/optab though. Maybe we should instead have some kind of > > > abstract, target-specific cookie created by a separate intrinsic. > > > Haven't thought much about it yet... > > > > Or maybe only support 0..N with N the length of the vector? It is > > pretty important to support 0 and N, but greater than N isn't as > > important (it is useful for tricky hand-written code, but not as much > > for compiler-generate code -- we only support an 8-bit number here on > > Power, maybe that is why ;-) ) > > The question is one of semantics - if power masks the length to an > 8 bit number it's important to preprocess the IV. In the instructions it *is* an 8 bit number (it is the top 8 bits of a GPR). > As with my > other suggestion the question is what to expose to the IL (to GIMPLE) > here. Yes, I understand that. Hence my answer :-) Only multiples of element size would be fine as well of course. > Exposing as much as possible will help IV selection but > will eventually require IFN variations for different semantics. > > So yes, 0..N sounds about right here and we'll require a MIN () > operation and likely need to teach IV selection about this to at least > possibly get an IV with the byte size multiplication factored. Maybe we should have a hook to say which lengths are allowed for which element type? And, how does this work for variable lengths (the usual case!) Segher