mbox series

[0/7] Support vector load/store with length

Message ID 30906c0d-3b9f-e1e6-156f-c01fcf229cb9@linux.ibm.com
Headers show
Series Support vector load/store with length | expand

Message

Kewen.Lin May 26, 2020, 5:49 a.m. UTC
Hi all,

This patch set adds support for vector load/store with length, Power 
ISA 3.0 brings instructions lxvl/stxvl to perform vector load/store with
length, it's good to be exploited for those cases we don't have enough
stuffs to fill in the whole vector like epilogues.

This support mainly refers to the handlings for fully-predicated loop
but it also covers the epilogue usage.  Now it supports two modes
controlled by parameter vect-with-length-scope, it can support any
loops fully with length or just for those cases with small iteration
counts less than VF like epilogue, for now I don't have ready env to
benchmark it, but based on the current inefficient length generation,
I don't think it's a good idea to adopt vector with length for any loops.
For the main loop which used to be vectorized, it increases register
pressure and introduces extra computation for length, the pro for icache
seems not comparable.  But I think it might be a good idea to keep this
parameter there for functionality testing, further benchmarking and other
ports' potential future supports.

As we don't have any benchmarking, this support isn't enabled by default
for any particular cpus, all testings are with explicit parameter setting.

Bootstrapped on powerpc64le-linux-gnu P9 with all vect-with-length-scope
settings (0/1/2).  Regress-test passed with vector-with-length-scope 0,
for the other twos, several vector related cases need to be updated, no
remarkable failures found.  BTW, P9 is the one which supports the
functionality but not ready to evaluate the performance.

Here still are many things to be supported or improved, not limited to:
  - reduction/live-out support
  - Cost model adding/tweaking
  - IFN gimple folding
  - Some unnecessary ops improvements eg: vector_size check
  - Some possible refactoring
I'll support/post the patches gradually.

Any comments are highly appreciated.

BR,
Kewen
-----

Patch set outline:
  [PATCH 1/7] ifn/optabs: Support vector load/store with length
  [PATCH 2/7] rs6000: lenload/lenstore optab support
  [PATCH 3/7] vect: Factor out codes for niters smaller than vf check
  [PATCH 4/7] hook/rs6000: Add vectorize length mode for vector with length
  [PATCH 5/7] vect: Support vector load/store with length in vectorizer
  [PATCH 6/7] ivopts: Add handlings for vector with length IFNs
  [PATCH 7/7] rs6000/testsuite: Vector with length test cases

 gcc/config/rs6000/rs6000.c                                  |   3 +
 gcc/config/rs6000/vsx.md                                    |  30 ++++++++++
 gcc/doc/invoke.texi                                         |   7 +++
 gcc/doc/md.texi                                             |  16 ++++++
 gcc/doc/tm.texi                                             |   6 ++
 gcc/doc/tm.texi.in                                          |   2 +
 gcc/internal-fn.c                                           |  13 ++++-
 gcc/internal-fn.def                                         |   6 ++
 gcc/optabs.def                                              |   2 +
 gcc/params.opt                                              |   4 ++
 gcc/target.def                                              |   7 +++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-1.h          |  18 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-2.h          |  17 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-3.h          |  31 +++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-4.h          |  24 ++++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-5.h          |  29 ++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-6.h          |  32 +++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-1.c     |  15 +++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-2.c     |  15 +++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-3.c     |  18 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-4.c     |  15 +++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-5.c     |  15 +++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-6.c     |  16 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-1.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-2.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-3.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-4.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-5.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-6.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-1.c     |  16 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-2.c     |  16 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-3.c     |  17 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-4.c     |  16 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-5.c     |  16 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-6.c     |  16 ++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-1.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-2.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-3.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-4.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-5.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-6.c |  10 ++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-1.h      |  34 ++++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-2.h      |  36 ++++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-3.h      |  34 ++++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-4.h      |  62 +++++++++++++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-5.h      |  45 +++++++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-6.h      |  52 +++++++++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-vec-length.h            |  14 +++++
 gcc/tree-ssa-loop-ivopts.c                                  |   4 ++
 gcc/tree-vect-loop-manip.c                                  | 268 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 gcc/tree-vect-loop.c                                        | 272 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 gcc/tree-vect-stmts.c                                       | 152 ++++++++++++++++++++++++++++++++++++++++++++++++++
 gcc/tree-vectorizer.h                                       |  32 +++++++++++
 53 files changed, 1545 insertions(+), 18 deletions(-)

Comments

Richard Biener May 26, 2020, 7:12 a.m. UTC | #1
On Tue, 26 May 2020, Kewen.Lin wrote:

> Hi all,
> 
> This patch set adds support for vector load/store with length, Power 
> ISA 3.0 brings instructions lxvl/stxvl to perform vector load/store with
> length, it's good to be exploited for those cases we don't have enough
> stuffs to fill in the whole vector like epilogues.
> 
> This support mainly refers to the handlings for fully-predicated loop
> but it also covers the epilogue usage.  Now it supports two modes
> controlled by parameter vect-with-length-scope, it can support any
> loops fully with length or just for those cases with small iteration
> counts less than VF like epilogue, for now I don't have ready env to
> benchmark it, but based on the current inefficient length generation,
> I don't think it's a good idea to adopt vector with length for any loops.
> For the main loop which used to be vectorized, it increases register
> pressure and introduces extra computation for length, the pro for icache
> seems not comparable.  But I think it might be a good idea to keep this
> parameter there for functionality testing, further benchmarking and other
> ports' potential future supports.

Can you explain in more detail what "vector load/store with length" does?
Is that a "simplified" masked load/store which instead of masking 
arbitrary elements (and need a mask computed in the first place), masks
elements > N (the length operand)?  Thus assuming a lane IV decrementing
to zero that IV would be the natural argument for the length operand?
If that's correct, what data are the remaining lanes filled with?

From a look at the series description below you seem to add a new way
of doing loads for this.  Did you review other ISAs (those I'm not
familiar with myself too much are SVE, RISC-V and GCN) in GCC whether
they have similar support and whether your approach can be supported
there?  ISTR SVE must have some similar support - what's the reason
you do not piggy-back on that?

I think a load like I described above might be represented as

_1 = __VIEW_CONVERT <v4df_t> (__MEM <double[n_2]> ((double *)p_3));

not sure if that actually works out though.  But given it seems it
is a contiguous load we shouldn't need an internal function here?
[there's a possible size mismatch in the __VIEW_CONVERT above, I guess
on RTL you end up with a paradoxical subreg - or an UNSPEC]

That said, I'm not very happy seeing yet another way of doing loads
[for fully predicated loops].  I'd rather like to see a single
representation on GIMPLE at least.

Will dig into the patch once the actual workings of those load/store with
length is confirmed.

I don't spot tree-vect-slp.c being changed - maybe that's not necessary
for SLP operation, but please do not introduce new vectorizer features
without supporting SLP operation at this point.

Thanks,
Richard.

> As we don't have any benchmarking, this support isn't enabled by default
> for any particular cpus, all testings are with explicit parameter setting.
> 
> Bootstrapped on powerpc64le-linux-gnu P9 with all vect-with-length-scope
> settings (0/1/2).  Regress-test passed with vector-with-length-scope 0,
> for the other twos, several vector related cases need to be updated, no
> remarkable failures found.  BTW, P9 is the one which supports the
> functionality but not ready to evaluate the performance.
> 
> Here still are many things to be supported or improved, not limited to:
>   - reduction/live-out support
>   - Cost model adding/tweaking
>   - IFN gimple folding
>   - Some unnecessary ops improvements eg: vector_size check
>   - Some possible refactoring
> I'll support/post the patches gradually.
> 
> Any comments are highly appreciated.
> 
> BR,
> Kewen
> -----
> 
> Patch set outline:
>   [PATCH 1/7] ifn/optabs: Support vector load/store with length
>   [PATCH 2/7] rs6000: lenload/lenstore optab support
>   [PATCH 3/7] vect: Factor out codes for niters smaller than vf check
>   [PATCH 4/7] hook/rs6000: Add vectorize length mode for vector with length
>   [PATCH 5/7] vect: Support vector load/store with length in vectorizer
>   [PATCH 6/7] ivopts: Add handlings for vector with length IFNs
>   [PATCH 7/7] rs6000/testsuite: Vector with length test cases
> 
>  gcc/config/rs6000/rs6000.c                                  |   3 +
>  gcc/config/rs6000/vsx.md                                    |  30 ++++++++++
>  gcc/doc/invoke.texi                                         |   7 +++
>  gcc/doc/md.texi                                             |  16 ++++++
>  gcc/doc/tm.texi                                             |   6 ++
>  gcc/doc/tm.texi.in                                          |   2 +
>  gcc/internal-fn.c                                           |  13 ++++-
>  gcc/internal-fn.def                                         |   6 ++
>  gcc/optabs.def                                              |   2 +
>  gcc/params.opt                                              |   4 ++
>  gcc/target.def                                              |   7 +++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-1.h          |  18 ++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-2.h          |  17 ++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-3.h          |  31 +++++++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-4.h          |  24 ++++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-5.h          |  29 ++++++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-6.h          |  32 +++++++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-1.c     |  15 +++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-2.c     |  15 +++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-3.c     |  18 ++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-4.c     |  15 +++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-5.c     |  15 +++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-6.c     |  16 ++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-1.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-2.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-3.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-4.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-5.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-epil-run-6.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-1.c     |  16 ++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-2.c     |  16 ++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-3.c     |  17 ++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-4.c     |  16 ++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-5.c     |  16 ++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-6.c     |  16 ++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-1.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-2.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-3.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-4.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-5.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-run-6.c |  10 ++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-1.h      |  34 ++++++++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-2.h      |  36 ++++++++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-3.h      |  34 ++++++++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-4.h      |  62 +++++++++++++++++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-5.h      |  45 +++++++++++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-run-6.h      |  52 +++++++++++++++++
>  gcc/testsuite/gcc.target/powerpc/p9-vec-length.h            |  14 +++++
>  gcc/tree-ssa-loop-ivopts.c                                  |   4 ++
>  gcc/tree-vect-loop-manip.c                                  | 268 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  gcc/tree-vect-loop.c                                        | 272 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  gcc/tree-vect-stmts.c                                       | 152 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  gcc/tree-vectorizer.h                                       |  32 +++++++++++
>  53 files changed, 1545 insertions(+), 18 deletions(-)
>
Kewen.Lin May 26, 2020, 8:51 a.m. UTC | #2
Hi Richi,

on 2020/5/26 下午3:12, Richard Biener wrote:
> On Tue, 26 May 2020, Kewen.Lin wrote:
> 
>> Hi all,
>>
>> This patch set adds support for vector load/store with length, Power 
>> ISA 3.0 brings instructions lxvl/stxvl to perform vector load/store with
>> length, it's good to be exploited for those cases we don't have enough
>> stuffs to fill in the whole vector like epilogues.
>>
>> This support mainly refers to the handlings for fully-predicated loop
>> but it also covers the epilogue usage.  Now it supports two modes
>> controlled by parameter vect-with-length-scope, it can support any
>> loops fully with length or just for those cases with small iteration
>> counts less than VF like epilogue, for now I don't have ready env to
>> benchmark it, but based on the current inefficient length generation,
>> I don't think it's a good idea to adopt vector with length for any loops.
>> For the main loop which used to be vectorized, it increases register
>> pressure and introduces extra computation for length, the pro for icache
>> seems not comparable.  But I think it might be a good idea to keep this
>> parameter there for functionality testing, further benchmarking and other
>> ports' potential future supports.
> 
> Can you explain in more detail what "vector load/store with length" does?
> Is that a "simplified" masked load/store which instead of masking 
> arbitrary elements (and need a mask computed in the first place), masks
> elements > N (the length operand)?  Thus assuming a lane IV decrementing
> to zero that IV would be the natural argument for the length operand?
> If that's correct, what data are the remaining lanes filled with?
> 

The vector load/store have one GPR holding one length in bytes (called as
n here) to control how many bytes we want to load/store.  If n > vector_size
(on Power it's 16), n will be taken as 16, if n is zero, the storage access
won't happen, the result for load is vector zero, if n > 0 but < vector_size,
the remaining lanes will be filled with zero.  On Power, it checks 0:7 bits
of the length GPR, so the length should be adjusted.

Your understanding is correct!  It's a "simplified" masked load/store, need
the length in bytes computed, only support continuous access.  For the lane
IV, the one should multiply with the lane bytes and be adjusted if needed.

> From a look at the series description below you seem to add a new way
> of doing loads for this.  Did you review other ISAs (those I'm not
> familiar with myself too much are SVE, RISC-V and GCN) in GCC whether
> they have similar support and whether your approach can be supported
> there?  ISTR SVE must have some similar support - what's the reason
> you do not piggy-back on that?

I may miss something, but I didn't find there is any support that meet with
this in GCC.  Good suggestion on ISAs, I didn't review other ISAs, for the
current implementation, I heavily referred to existing SVE fully predicated
loop support, it's stronger than "with length", it can deal with arbitrary
elements, only perform for the loop fully etc.

> 
> I think a load like I described above might be represented as
> 
> _1 = __VIEW_CONVERT <v4df_t> (__MEM <double[n_2]> ((double *)p_3));
> 
> not sure if that actually works out though.  But given it seems it
> is a contiguous load we shouldn't need an internal function here?
> [there's a possible size mismatch in the __VIEW_CONVERT above, I guess
> on RTL you end up with a paradoxical subreg - or an UNSPEC]

IIUC, what you suggested is to avoid use IFN here.  May I know the reason?
is it due to the maintenance efforts on various IFNs?  I thought using
IFN is gentle.  And agreed, I had the concern that the size mismatch
problem here since the length can be large (extremely large probably, it
depends on target saturated limit), the converted vector size can be small.
Besides, the length can be a variable.
> 
> That said, I'm not very happy seeing yet another way of doing loads
> [for fully predicated loops].  I'd rather like to see a single
> representation on GIMPLE at least.

OK.  Does it mean IFN isn't counted into this scope?  :)

> 
> Will dig into the patch once the actual workings of those load/store with
> length is confirmed.

Thanks a lot for your time!

> 
> I don't spot tree-vect-slp.c being changed - maybe that's not necessary
> for SLP operation, but please do not introduce new vectorizer features
> without supporting SLP operation at this point.

Got it.  SLP is also one part to be supported, I forgot to state in the
original email.  I'll let it be for now.  :)

BR,
Kewen
Richard Biener May 26, 2020, 9:44 a.m. UTC | #3
On Tue, 26 May 2020, Kewen.Lin wrote:

> Hi Richi,
> 
> on 2020/5/26 下午3:12, Richard Biener wrote:
> > On Tue, 26 May 2020, Kewen.Lin wrote:
> > 
> >> Hi all,
> >>
> >> This patch set adds support for vector load/store with length, Power 
> >> ISA 3.0 brings instructions lxvl/stxvl to perform vector load/store with
> >> length, it's good to be exploited for those cases we don't have enough
> >> stuffs to fill in the whole vector like epilogues.
> >>
> >> This support mainly refers to the handlings for fully-predicated loop
> >> but it also covers the epilogue usage.  Now it supports two modes
> >> controlled by parameter vect-with-length-scope, it can support any
> >> loops fully with length or just for those cases with small iteration
> >> counts less than VF like epilogue, for now I don't have ready env to
> >> benchmark it, but based on the current inefficient length generation,
> >> I don't think it's a good idea to adopt vector with length for any loops.
> >> For the main loop which used to be vectorized, it increases register
> >> pressure and introduces extra computation for length, the pro for icache
> >> seems not comparable.  But I think it might be a good idea to keep this
> >> parameter there for functionality testing, further benchmarking and other
> >> ports' potential future supports.
> > 
> > Can you explain in more detail what "vector load/store with length" does?
> > Is that a "simplified" masked load/store which instead of masking 
> > arbitrary elements (and need a mask computed in the first place), masks
> > elements > N (the length operand)?  Thus assuming a lane IV decrementing
> > to zero that IV would be the natural argument for the length operand?
> > If that's correct, what data are the remaining lanes filled with?
> > 
> 
> The vector load/store have one GPR holding one length in bytes (called as
> n here) to control how many bytes we want to load/store.  If n > vector_size
> (on Power it's 16), n will be taken as 16, if n is zero, the storage access
> won't happen, the result for load is vector zero, if n > 0 but < vector_size,
> the remaining lanes will be filled with zero.  On Power, it checks 0:7 bits
> of the length GPR, so the length should be adjusted.
> 
> Your understanding is correct!  It's a "simplified" masked load/store, need
> the length in bytes computed, only support continuous access.  For the lane
> IV, the one should multiply with the lane bytes and be adjusted if needed.
> 
> > From a look at the series description below you seem to add a new way
> > of doing loads for this.  Did you review other ISAs (those I'm not
> > familiar with myself too much are SVE, RISC-V and GCN) in GCC whether
> > they have similar support and whether your approach can be supported
> > there?  ISTR SVE must have some similar support - what's the reason
> > you do not piggy-back on that?
> 
> I may miss something, but I didn't find there is any support that meet with
> this in GCC.  Good suggestion on ISAs, I didn't review other ISAs, for the
> current implementation, I heavily referred to existing SVE fully predicated
> loop support, it's stronger than "with length", it can deal with arbitrary
> elements, only perform for the loop fully etc.
> 
> > 
> > I think a load like I described above might be represented as
> > 
> > _1 = __VIEW_CONVERT <v4df_t> (__MEM <double[n_2]> ((double *)p_3));
> > 
> > not sure if that actually works out though.  But given it seems it
> > is a contiguous load we shouldn't need an internal function here?
> > [there's a possible size mismatch in the __VIEW_CONVERT above, I guess
> > on RTL you end up with a paradoxical subreg - or an UNSPEC]
> 
> IIUC, what you suggested is to avoid use IFN here.  May I know the reason?
> is it due to the maintenance efforts on various IFNs?  I thought using
> IFN is gentle.  And agreed, I had the concern that the size mismatch
> problem here since the length can be large (extremely large probably, it
> depends on target saturated limit), the converted vector size can be small.
> Besides, the length can be a variable.
>
> > 
> > That said, I'm not very happy seeing yet another way of doing loads
> > [for fully predicated loops].  I'd rather like to see a single
> > representation on GIMPLE at least.
> 
> OK.  Does it mean IFN isn't counted into this scope?  :)

Sure going with a new IFN is a possibility.  I'd just avoid that if
possible ;)  How does SVE represent loads/stores for whilelt loops?
Would it not be possible to share that representation somehow?

Richard.

> > 
> > Will dig into the patch once the actual workings of those load/store with
> > length is confirmed.
> 
> Thanks a lot for your time!
> 
> > 
> > I don't spot tree-vect-slp.c being changed - maybe that's not necessary
> > for SLP operation, but please do not introduce new vectorizer features
> > without supporting SLP operation at this point.
> 
> Got it.  SLP is also one part to be supported, I forgot to state in the
> original email.  I'll let it be for now.  :)
> 
> BR,
> Kewen
>
Kewen.Lin May 26, 2020, 10:10 a.m. UTC | #4
on 2020/5/26 下午5:44, Richard Biener wrote:
> On Tue, 26 May 2020, Kewen.Lin wrote:
> 
>> Hi Richi,
>>
>> on 2020/5/26 下午3:12, Richard Biener wrote:
>>> On Tue, 26 May 2020, Kewen.Lin wrote:
>>>
>>>> Hi all,
>>>>
>>>> This patch set adds support for vector load/store with length, Power 
>>>> ISA 3.0 brings instructions lxvl/stxvl to perform vector load/store with
>>>> length, it's good to be exploited for those cases we don't have enough
>>>> stuffs to fill in the whole vector like epilogues.
>>>>
>>>> This support mainly refers to the handlings for fully-predicated loop
>>>> but it also covers the epilogue usage.  Now it supports two modes
>>>> controlled by parameter vect-with-length-scope, it can support any
>>>> loops fully with length or just for those cases with small iteration
>>>> counts less than VF like epilogue, for now I don't have ready env to
>>>> benchmark it, but based on the current inefficient length generation,
>>>> I don't think it's a good idea to adopt vector with length for any loops.
>>>> For the main loop which used to be vectorized, it increases register
>>>> pressure and introduces extra computation for length, the pro for icache
>>>> seems not comparable.  But I think it might be a good idea to keep this
>>>> parameter there for functionality testing, further benchmarking and other
>>>> ports' potential future supports.
>>>
>>> Can you explain in more detail what "vector load/store with length" does?
>>> Is that a "simplified" masked load/store which instead of masking 
>>> arbitrary elements (and need a mask computed in the first place), masks
>>> elements > N (the length operand)?  Thus assuming a lane IV decrementing
>>> to zero that IV would be the natural argument for the length operand?
>>> If that's correct, what data are the remaining lanes filled with?
>>>
>>
>> The vector load/store have one GPR holding one length in bytes (called as
>> n here) to control how many bytes we want to load/store.  If n > vector_size
>> (on Power it's 16), n will be taken as 16, if n is zero, the storage access
>> won't happen, the result for load is vector zero, if n > 0 but < vector_size,
>> the remaining lanes will be filled with zero.  On Power, it checks 0:7 bits
>> of the length GPR, so the length should be adjusted.
>>
>> Your understanding is correct!  It's a "simplified" masked load/store, need
>> the length in bytes computed, only support continuous access.  For the lane
>> IV, the one should multiply with the lane bytes and be adjusted if needed.
>>
>>> From a look at the series description below you seem to add a new way
>>> of doing loads for this.  Did you review other ISAs (those I'm not
>>> familiar with myself too much are SVE, RISC-V and GCN) in GCC whether
>>> they have similar support and whether your approach can be supported
>>> there?  ISTR SVE must have some similar support - what's the reason
>>> you do not piggy-back on that?
>>
>> I may miss something, but I didn't find there is any support that meet with
>> this in GCC.  Good suggestion on ISAs, I didn't review other ISAs, for the
>> current implementation, I heavily referred to existing SVE fully predicated
>> loop support, it's stronger than "with length", it can deal with arbitrary
>> elements, only perform for the loop fully etc.
>>
>>>
>>> I think a load like I described above might be represented as
>>>
>>> _1 = __VIEW_CONVERT <v4df_t> (__MEM <double[n_2]> ((double *)p_3));
>>>
>>> not sure if that actually works out though.  But given it seems it
>>> is a contiguous load we shouldn't need an internal function here?
>>> [there's a possible size mismatch in the __VIEW_CONVERT above, I guess
>>> on RTL you end up with a paradoxical subreg - or an UNSPEC]
>>
>> IIUC, what you suggested is to avoid use IFN here.  May I know the reason?
>> is it due to the maintenance efforts on various IFNs?  I thought using
>> IFN is gentle.  And agreed, I had the concern that the size mismatch
>> problem here since the length can be large (extremely large probably, it
>> depends on target saturated limit), the converted vector size can be small.
>> Besides, the length can be a variable.
>>
>>>
>>> That said, I'm not very happy seeing yet another way of doing loads
>>> [for fully predicated loops].  I'd rather like to see a single
>>> representation on GIMPLE at least.
>>
>> OK.  Does it mean IFN isn't counted into this scope?  :)
> 
> Sure going with a new IFN is a possibility.  I'd just avoid that if
> possible ;)  How does SVE represent loads/stores for whilelt loops?
> Would it not be possible to share that representation somehow?
> 
> Richard.
> 

Got it.  :) Currently SVE uses IFNs .MASK_LOAD and .MASK_STORE for 
whileult loops:

  vect__1.5_14 = .MASK_LOAD (_11, 4B, loop_mask_9);
  ...
  .MASK_STORE (_1, 4B, loop_mask_9, vect__3.9_18);
  ...
  next_mask_26 = .WHILE_ULT (_2, 127, { 0, ... });
  if (next_mask_26 != { 0, ... })

I thought to share it once, but didn't feel it's a good idea since it's
like we take something as masks that isn't actually masks, easily confuse
people.  OTOH, Power or other potential targets probably supports these
kinds of masking instructions, easily to confuse people further.

But I definitely agree that we can refactor some codes to share with
each other when possible.

Btw, here the code with proposed IFNs looks like:

  vect__1.5_4 = .LEN_LOAD (_15, 4B, loop_len_5);
  ...
  .LEN_STORE (_1, 4B, loop_len_5, vect__3.9_18);
  ivtmp_23 = ivtmp_22 + 16;
  _25 = MIN_EXPR <ivtmp_23, 508>;
  _26 = 508 - _25;
  _27 = MIN_EXPR <_26, 16>;
  if (ivtmp_23 <= 507)


BR,
Kewen

>>>
>>> Will dig into the patch once the actual workings of those load/store with
>>> length is confirmed.
>>
>> Thanks a lot for your time!
>>
>>>
>>> I don't spot tree-vect-slp.c being changed - maybe that's not necessary
>>> for SLP operation, but please do not introduce new vectorizer features
>>> without supporting SLP operation at this point.
>>
>> Got it.  SLP is also one part to be supported, I forgot to state in the
>> original email.  I'll let it be for now.  :)
>>
>> BR,
>> Kewen
>>
>
Richard Sandiford May 26, 2020, 12:29 p.m. UTC | #5
"Kewen.Lin" <linkw@linux.ibm.com> writes:
> on 2020/5/26 下午5:44, Richard Biener wrote:
>> On Tue, 26 May 2020, Kewen.Lin wrote:
>> 
>>> Hi Richi,
>>>
>>> on 2020/5/26 下午3:12, Richard Biener wrote:
>>>> On Tue, 26 May 2020, Kewen.Lin wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> This patch set adds support for vector load/store with length, Power 
>>>>> ISA 3.0 brings instructions lxvl/stxvl to perform vector load/store with
>>>>> length, it's good to be exploited for those cases we don't have enough
>>>>> stuffs to fill in the whole vector like epilogues.
>>>>>
>>>>> This support mainly refers to the handlings for fully-predicated loop
>>>>> but it also covers the epilogue usage.  Now it supports two modes
>>>>> controlled by parameter vect-with-length-scope, it can support any
>>>>> loops fully with length or just for those cases with small iteration
>>>>> counts less than VF like epilogue, for now I don't have ready env to
>>>>> benchmark it, but based on the current inefficient length generation,
>>>>> I don't think it's a good idea to adopt vector with length for any loops.
>>>>> For the main loop which used to be vectorized, it increases register
>>>>> pressure and introduces extra computation for length, the pro for icache
>>>>> seems not comparable.  But I think it might be a good idea to keep this
>>>>> parameter there for functionality testing, further benchmarking and other
>>>>> ports' potential future supports.
>>>>
>>>> Can you explain in more detail what "vector load/store with length" does?
>>>> Is that a "simplified" masked load/store which instead of masking 
>>>> arbitrary elements (and need a mask computed in the first place), masks
>>>> elements > N (the length operand)?  Thus assuming a lane IV decrementing
>>>> to zero that IV would be the natural argument for the length operand?
>>>> If that's correct, what data are the remaining lanes filled with?
>>>>
>>>
>>> The vector load/store have one GPR holding one length in bytes (called as
>>> n here) to control how many bytes we want to load/store.  If n > vector_size
>>> (on Power it's 16), n will be taken as 16, if n is zero, the storage access
>>> won't happen, the result for load is vector zero, if n > 0 but < vector_size,
>>> the remaining lanes will be filled with zero.  On Power, it checks 0:7 bits
>>> of the length GPR, so the length should be adjusted.
>>>
>>> Your understanding is correct!  It's a "simplified" masked load/store, need
>>> the length in bytes computed, only support continuous access.  For the lane
>>> IV, the one should multiply with the lane bytes and be adjusted if needed.
>>>
>>>> From a look at the series description below you seem to add a new way
>>>> of doing loads for this.  Did you review other ISAs (those I'm not
>>>> familiar with myself too much are SVE, RISC-V and GCN) in GCC whether
>>>> they have similar support and whether your approach can be supported
>>>> there?  ISTR SVE must have some similar support - what's the reason
>>>> you do not piggy-back on that?
>>>
>>> I may miss something, but I didn't find there is any support that meet with
>>> this in GCC.  Good suggestion on ISAs, I didn't review other ISAs, for the
>>> current implementation, I heavily referred to existing SVE fully predicated
>>> loop support, it's stronger than "with length", it can deal with arbitrary
>>> elements, only perform for the loop fully etc.
>>>
>>>>
>>>> I think a load like I described above might be represented as
>>>>
>>>> _1 = __VIEW_CONVERT <v4df_t> (__MEM <double[n_2]> ((double *)p_3));
>>>>
>>>> not sure if that actually works out though.  But given it seems it
>>>> is a contiguous load we shouldn't need an internal function here?
>>>> [there's a possible size mismatch in the __VIEW_CONVERT above, I guess
>>>> on RTL you end up with a paradoxical subreg - or an UNSPEC]
>>>
>>> IIUC, what you suggested is to avoid use IFN here.  May I know the reason?
>>> is it due to the maintenance efforts on various IFNs?  I thought using
>>> IFN is gentle.  And agreed, I had the concern that the size mismatch
>>> problem here since the length can be large (extremely large probably, it
>>> depends on target saturated limit), the converted vector size can be small.
>>> Besides, the length can be a variable.
>>>
>>>>
>>>> That said, I'm not very happy seeing yet another way of doing loads
>>>> [for fully predicated loops].  I'd rather like to see a single
>>>> representation on GIMPLE at least.
>>>
>>> OK.  Does it mean IFN isn't counted into this scope?  :)
>> 
>> Sure going with a new IFN is a possibility.  I'd just avoid that if
>> possible ;)  How does SVE represent loads/stores for whilelt loops?
>> Would it not be possible to share that representation somehow?
>> 
>> Richard.
>> 
>
> Got it.  :) Currently SVE uses IFNs .MASK_LOAD and .MASK_STORE for 
> whileult loops:
>
>   vect__1.5_14 = .MASK_LOAD (_11, 4B, loop_mask_9);
>   ...
>   .MASK_STORE (_1, 4B, loop_mask_9, vect__3.9_18);
>   ...
>   next_mask_26 = .WHILE_ULT (_2, 127, { 0, ... });
>   if (next_mask_26 != { 0, ... })
>
> I thought to share it once, but didn't feel it's a good idea since it's
> like we take something as masks that isn't actually masks, easily confuse
> people.  OTOH, Power or other potential targets probably supports these
> kinds of masking instructions, easily to confuse people further.
>
> But I definitely agree that we can refactor some codes to share with
> each other when possible.
>
> Btw, here the code with proposed IFNs looks like:
>
>   vect__1.5_4 = .LEN_LOAD (_15, 4B, loop_len_5);
>   ...
>   .LEN_STORE (_1, 4B, loop_len_5, vect__3.9_18);
>   ivtmp_23 = ivtmp_22 + 16;
>   _25 = MIN_EXPR <ivtmp_23, 508>;
>   _26 = 508 - _25;
>   _27 = MIN_EXPR <_26, 16>;
>   if (ivtmp_23 <= 507)

FWIW, I agree adding .LEN_LOAD and .LEN_STORE seems like a good
approach.  I think it'll be more maintainable in the long run than
trying to have .MASK_LOADs and .MASK_STOREs that need a special mask
operand.  (That would be too similar to VEC_COND_EXPR :-))

Not sure yet what the exact semantics wrt out-of-range values for
the IFN/optab though.  Maybe we should instead have some kind of
abstract, target-specific cookie created by a separate intrinsic.
Haven't thought much about it yet...

Thanks,
Richard
Jim Wilson May 26, 2020, 10:34 p.m. UTC | #6
On Tue, May 26, 2020 at 12:12 AM Richard Biener <rguenther@suse.de> wrote:
> From a look at the series description below you seem to add a new way
> of doing loads for this.  Did you review other ISAs (those I'm not
> familiar with myself too much are SVE, RISC-V and GCN) in GCC whether
> they have similar support and whether your approach can be supported
> there?  ISTR SVE must have some similar support - what's the reason
> you do not piggy-back on that?

There isn't any RISC-V Vector support in GCC yet.  The RVV spec is
still in draft and still occasionally changing in incompatible ways.
We've done some experimenting with gcc patches, but all we have are
intrinsics.  We haven't implemented any auto vectorization support, so
we haven't defined tree representations for anything yet, other than
the types we need for intrinsics support.  But if it looks OK for SVE
then it probably will be OK for RVV.

Jim
Segher Boessenkool May 27, 2020, 12:09 a.m. UTC | #7
Hi!

On Tue, May 26, 2020 at 01:29:30PM +0100, Richard Sandiford wrote:
> FWIW, I agree adding .LEN_LOAD and .LEN_STORE seems like a good
> approach.  I think it'll be more maintainable in the long run than
> trying to have .MASK_LOADs and .MASK_STOREs that need a special mask
> operand.  (That would be too similar to VEC_COND_EXPR :-))
> 
> Not sure yet what the exact semantics wrt out-of-range values for
> the IFN/optab though.  Maybe we should instead have some kind of
> abstract, target-specific cookie created by a separate intrinsic.
> Haven't thought much about it yet...

Or maybe only support 0..N with N the length of the vector?  It is
pretty important to support 0 and N, but greater than N isn't as
important (it is useful for tricky hand-written code, but not as much
for compiler-generate code -- we only support an 8-bit number here on
Power, maybe that is why ;-) )


Segher
Richard Biener May 27, 2020, 7:21 a.m. UTC | #8
On Tue, 26 May 2020, Jim Wilson wrote:

> On Tue, May 26, 2020 at 12:12 AM Richard Biener <rguenther@suse.de> wrote:
> > From a look at the series description below you seem to add a new way
> > of doing loads for this.  Did you review other ISAs (those I'm not
> > familiar with myself too much are SVE, RISC-V and GCN) in GCC whether
> > they have similar support and whether your approach can be supported
> > there?  ISTR SVE must have some similar support - what's the reason
> > you do not piggy-back on that?
> 
> There isn't any RISC-V Vector support in GCC yet.  The RVV spec is
> still in draft and still occasionally changing in incompatible ways.
> We've done some experimenting with gcc patches, but all we have are
> intrinsics.  We haven't implemented any auto vectorization support, so
> we haven't defined tree representations for anything yet, other than
> the types we need for intrinsics support.  But if it looks OK for SVE
> then it probably will be OK for RVV.

Btw, I'm specifically looking for other load/store with length
implementations and as to whether they agree on taking bytes for
the length rather than, for example the number of lanes.  I guess
exposing this detail on GIMPLE can help IV selection but if we'd
ever get a differing semantics ISA we'd have to add another set
of IFNs, so maybe the PPC ones should be named in a more specific
way like _WITH_BYTES or _BYTES or _WITH_BYTE_LENGTH or so to
allow _WITH_LANES?

Richard.

> Jim
>
Richard Biener May 27, 2020, 7:25 a.m. UTC | #9
On Tue, 26 May 2020, Segher Boessenkool wrote:

> Hi!
> 
> On Tue, May 26, 2020 at 01:29:30PM +0100, Richard Sandiford wrote:
> > FWIW, I agree adding .LEN_LOAD and .LEN_STORE seems like a good
> > approach.  I think it'll be more maintainable in the long run than
> > trying to have .MASK_LOADs and .MASK_STOREs that need a special mask
> > operand.  (That would be too similar to VEC_COND_EXPR :-))
> > 
> > Not sure yet what the exact semantics wrt out-of-range values for
> > the IFN/optab though.  Maybe we should instead have some kind of
> > abstract, target-specific cookie created by a separate intrinsic.
> > Haven't thought much about it yet...
> 
> Or maybe only support 0..N with N the length of the vector?  It is
> pretty important to support 0 and N, but greater than N isn't as
> important (it is useful for tricky hand-written code, but not as much
> for compiler-generate code -- we only support an 8-bit number here on
> Power, maybe that is why ;-) )

The question is one of semantics - if power masks the length to an
8 bit number it's important to preprocess the IV.  As with my
other suggestion the question is what to expose to the IL (to GIMPLE)
here.  Exposing as much as possible will help IV selection but
will eventually require IFN variations for different semantics.

So yes, 0..N sounds about right here and we'll require a MIN ()
operation and likely need to teach IV selection about this to at least
possibly get an IV with the byte size multiplication factored.

Richard.

> 
> Segher
>
Richard Sandiford May 27, 2020, 7:46 a.m. UTC | #10
Richard Biener <rguenther@suse.de> writes:
> On Tue, 26 May 2020, Jim Wilson wrote:
>
>> On Tue, May 26, 2020 at 12:12 AM Richard Biener <rguenther@suse.de> wrote:
>> > From a look at the series description below you seem to add a new way
>> > of doing loads for this.  Did you review other ISAs (those I'm not
>> > familiar with myself too much are SVE, RISC-V and GCN) in GCC whether
>> > they have similar support and whether your approach can be supported
>> > there?  ISTR SVE must have some similar support - what's the reason
>> > you do not piggy-back on that?
>> 
>> There isn't any RISC-V Vector support in GCC yet.  The RVV spec is
>> still in draft and still occasionally changing in incompatible ways.
>> We've done some experimenting with gcc patches, but all we have are
>> intrinsics.  We haven't implemented any auto vectorization support, so
>> we haven't defined tree representations for anything yet, other than
>> the types we need for intrinsics support.  But if it looks OK for SVE
>> then it probably will be OK for RVV.
>
> Btw, I'm specifically looking for other load/store with length
> implementations and as to whether they agree on taking bytes for
> the length rather than, for example the number of lanes.  I guess
> exposing this detail on GIMPLE can help IV selection but if we'd
> ever get a differing semantics ISA we'd have to add another set
> of IFNs, so maybe the PPC ones should be named in a more specific
> way like _WITH_BYTES or _BYTES or _WITH_BYTE_LENGTH or so to
> allow _WITH_LANES?

Maybe that detail is another thing that a cookie could hide.  We'd then
potentially need one IFN per approach to calculating the length parameter
(bytes vs. elements, self-capping vs. explicit capping, etc.), but it would
only be one IFN per approach, rather than the combinatorial explosion
we'd get from one IFN per approach*load/store-kind.

It doesn't make much difference when we only have one LOAD and one STORE
per approach.  But I imagine this will be useful for MVE, and there we'll
want extending loads, truncating stores, gathers and scatters too.

Thanks,
Richard
Kewen.Lin May 27, 2020, 8:50 a.m. UTC | #11
on 2020/5/27 下午3:25, Richard Biener wrote:
> On Tue, 26 May 2020, Segher Boessenkool wrote:
> 
>> Hi!
>>
>> On Tue, May 26, 2020 at 01:29:30PM +0100, Richard Sandiford wrote:
>>> FWIW, I agree adding .LEN_LOAD and .LEN_STORE seems like a good
>>> approach.  I think it'll be more maintainable in the long run than
>>> trying to have .MASK_LOADs and .MASK_STOREs that need a special mask
>>> operand.  (That would be too similar to VEC_COND_EXPR :-))
>>>
>>> Not sure yet what the exact semantics wrt out-of-range values for
>>> the IFN/optab though.  Maybe we should instead have some kind of
>>> abstract, target-specific cookie created by a separate intrinsic.
>>> Haven't thought much about it yet...
>>
>> Or maybe only support 0..N with N the length of the vector?  It is
>> pretty important to support 0 and N, but greater than N isn't as
>> important (it is useful for tricky hand-written code, but not as much
>> for compiler-generate code -- we only support an 8-bit number here on
>> Power, maybe that is why ;-) )
> 
> The question is one of semantics - if power masks the length to an
> 8 bit number it's important to preprocess the IV.  As with my
> other suggestion the question is what to expose to the IL (to GIMPLE)
> here.  Exposing as much as possible will help IV selection but
> will eventually require IFN variations for different semantics.
> 

In the current implementation, we don't use IFN for the length computation,
it has something like:

  ivtmp_28 = ivtmp_27 + 16;
  _39 = MIN_EXPR <ivtmp_28, _32>;  // _32 is the limit
  _40 = _32 - _39;                 // get the zero bytes for the ending
  _41 = MIN_EXPR <_40, 16>;        // check for vector size
  if (ivtmp_28 < _32)

In my initial thought, the len load/store IFNs are considered to accept any
lengths (any values hold in length mode), since the length larger than vector
size is no sense, the hardware can take it as saturated to vector size, if
hardware has some masking bits on it like ppc, we can add one hook to guard
the MIN requirement for length gen.  For now, the MIN is mandatory since ppc
is the only user.

FWIW, if we mostly adopt this for epilogues or small loop (iteration < VF),
the range can be analyzed during compilation time, these MIN computations
can be optimized theoricially.

> So yes, 0..N sounds about right here and we'll require a MIN ()
> operation and likely need to teach IV selection about this to at least
> possibly get an IV with the byte size multiplication factored.
> 

FWIW, in the current implementation, the step/limit have multiplied the bytes
of lanes first, the IV computation will not have the multilcation for it there.

BR,
Kewen

> Richard.
> 
>>
>> Segher
>>
>
Segher Boessenkool May 27, 2020, 2:08 p.m. UTC | #12
On Wed, May 27, 2020 at 09:25:43AM +0200, Richard Biener wrote:
> On Tue, 26 May 2020, Segher Boessenkool wrote:
> > On Tue, May 26, 2020 at 01:29:30PM +0100, Richard Sandiford wrote:
> > > FWIW, I agree adding .LEN_LOAD and .LEN_STORE seems like a good
> > > approach.  I think it'll be more maintainable in the long run than
> > > trying to have .MASK_LOADs and .MASK_STOREs that need a special mask
> > > operand.  (That would be too similar to VEC_COND_EXPR :-))
> > > 
> > > Not sure yet what the exact semantics wrt out-of-range values for
> > > the IFN/optab though.  Maybe we should instead have some kind of
> > > abstract, target-specific cookie created by a separate intrinsic.
> > > Haven't thought much about it yet...
> > 
> > Or maybe only support 0..N with N the length of the vector?  It is
> > pretty important to support 0 and N, but greater than N isn't as
> > important (it is useful for tricky hand-written code, but not as much
> > for compiler-generate code -- we only support an 8-bit number here on
> > Power, maybe that is why ;-) )
> 
> The question is one of semantics - if power masks the length to an
> 8 bit number it's important to preprocess the IV.

In the instructions it *is* an 8 bit number (it is the top 8 bits of a
GPR).

> As with my
> other suggestion the question is what to expose to the IL (to GIMPLE)
> here.

Yes, I understand that.  Hence my answer :-)

Only multiples of element size would be fine as well of course.

> Exposing as much as possible will help IV selection but
> will eventually require IFN variations for different semantics.
> 
> So yes, 0..N sounds about right here and we'll require a MIN ()
> operation and likely need to teach IV selection about this to at least
> possibly get an IV with the byte size multiplication factored.

Maybe we should have a hook to say which lengths are allowed for which
element type?

And, how does this work for variable lengths (the usual case!)


Segher