rfa: vectorize strided loads [2/2] [PR 18437]

Hi,

and this implements generally strided loads where the stride is a 
loop-invariant (constant or ssa-name).  We only do so if the load can't be 
handled by interleaving groups.  The implementation is fairly straight 
forward:

         for (i = 0; i < n; i += stride)
           ... = array[i];

is transformed into:

         for (j = 0; ; j += VF*stride)
           tmp1 = array[j];
           tmp2 = array[j + stride];
           ...
           vectemp = {tmp1, tmp2, ...}

(of course variously adjusted for component number)

The nice thing is that by such separate loads we don't need to care for 
alignment (if the old access was aligned, so will be the new as the access 
size doesn't change).

This is one more step in vectorizing the same loops as ICC.  polyhedron 
has such a case in rnflow, where it helps quite much:

Without patch:

   Benchmark   Compile  Executable   Ave Run  Number   Estim
        Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
          ac      3.71     3960337     10.06       2  0.0368
      aermod     75.30     5594185     19.30       5  0.7102
         air     10.30     4832222      6.59       2  0.0653
    capacita      7.23     4185227     57.25       2  0.0968
     channel      2.05     4658532      2.70       5  4.4701
       doduc     14.55     4237850     23.31       2  0.0768
     fatigue      6.64     4127554      7.48       5  0.3063
     gas_dyn     13.25     4113557      2.99       5  4.5063
      induct      7.23     4338356      8.85       5  0.9927
       linpk      1.24     3927350     10.37       5  5.1174
        mdbx      5.51     4053841     12.95       5  0.0956
          nf     10.69     4062276     11.44       5  0.2727
     protein     33.04     5011924     39.53       5  0.2328
      rnflow     25.35     4238651     31.78       2  0.0673
    test_fpu     12.24     4184939      9.00       5  0.1157
        tfft      1.15     3976710      3.95       3  0.0736

Geometric Mean Execution Time =      11.21 seconds

With patch:

   Benchmark   Compile  Executable   Ave Run  Number   Estim
        Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
          ac      3.91     3960337     10.34       5  0.3661
      aermod     76.44     5598281     19.03       5  1.3394
         air     11.52     4832222      6.75       5  0.9347
    capacita      7.78     4189323     56.33       2  0.0976
     channel      2.14     4658532      2.59       5  0.6640
       doduc     14.61     4237850     23.41       5  0.2974
     fatigue      6.62     4127554      7.44       5  0.1082
     gas_dyn     13.14     4113557      2.82       5  0.5253
      induct      7.26     4338356      8.49       2  0.0082
       linpk      1.23     3927350      9.78       2  0.0705
        mdbx      5.47     4053841     12.90       2  0.0601
          nf     10.67     4062276     11.33       2  0.0004
     protein     32.81     5011924     39.48       2  0.0893
      rnflow     26.57     4246843     26.70       5  0.0915
    test_fpu     13.29     4193131      8.82       5  0.2136
        tfft      1.14     3976710      3.95       5  0.1753

Geometric Mean Execution Time =      10.95 seconds

I.e. for rnflow from 31.78 to 26.70 seconds, nearly 20% better.

Regstrapped together with [1/2] on x86_64-linux, all langs+Ada, no 
regressions.  Okay for trunk?

Ciao,
Michael.

	PR tree-optimization/18437

	* tree-vectorizer.h (_stmt_vec_info.stride_load_p): New member.
	(STMT_VINFO_STRIDE_LOAD_P): New accessor.
	(vect_check_strided_load): Declare.
	* tree-vect-data-refs.c (vect_check_strided_load): New function.
	(vect_analyze_data_refs): Use it to accept strided loads.
	* tree-vect-stmts.c (vectorizable_load): Ditto and handle them.

testsuite/
	* gfortran.dg/vect/rnflow-trs2a2.f90: New test.

rfa: vectorize strided loads [2/2] [PR 18437]

Commit Message

Comments

Patch