mbox series

[0/9,COVER-LETTER,nvptx] Add support for warp-multiple openacc vector length

Message ID 20190112222131.29519-1-tdevries@suse.de
Headers show
Series Add support for warp-multiple openacc vector length | expand

Message

Tom de Vries Jan. 12, 2019, 10:21 p.m. UTC
I. Current state

The current openacc implementation sets vector length to warp-size.

There are two aspects that need to be implemented for an openacc implementation
to work: communication and synchronization.  Synchronization is needed at the
end of worker and vector loops.  Communication is needed at the start of worker
and vector loops, to propagate state that not has been calculated redundantly in
vector-single and worker-single mode to vector-partition and worker-partitioned
mode.

For worker loops, synchronization at the end of the loop is done using the
inter-warp synchronization instruction 'bar.sync 0'.  Communication is done
using a buffer in shared memory (and synchronization is used to ensure that the
buffer is used properly).

For vector loops with warp-sized vector length, synchronization at the end of
the loop is not needed, since warps are synchronized by definition.
Communication is done using the intra-warp communication instruction shfl.

These vector and worker schemes do not change if we nest a vector loop in a
worker loop.  OTOH, a vector-and-worker loop uses the worker scheme.

II. Patch series

This patch series adds the possibility to use warp-multiple openacc vector
length.

This means we can no longer rely on the same mechanisms for communication and
synchronization of vector loops, and need to apply the same ones as we do for
worker loops.

II.a Vector loop

A vector loop with warp-sized vector length looks as before.  A vector loop with
warp-multiple vector length looks like a simple worker loop.

II.b Vector-and-worker loop

A vector-and-worker loop with is handled as worker loop, as before.

II.c Vector loop in worker loop

A vector loop in worker loop with warp-sized vector length looks as before.

A vector loop in a worker loop with warp-multiple vector length is handled as
follows.

We use the 'bar.sync 0' instruction (which synchronizes all threads in a CTA)
for worker synchronization, but to synchronize only the warps that form a
vector together, we use 'bar.sync <id>, <vector-length>', where <id> uniquely
identifies the vector (we use the worker id, offset by one not to clash with
logical barrier resource '0' used by worker synchronization, so: %tid.y + 1).

Furthermore, the fact that vectors synchronize independently means that vector
state needs to be propagated independently.  We handle this by allocating a
state propagation buffer for each vector.  So, the shared memory buffer is
partitioned into a part for worker propagation, and num_worker parts for vector
propagation.

We'll name the first part worker-generic and the other parts worker-specific
(but we've got one vector per worker, so confusingly you might also call it
vector-specific).

In a vector loop in worker loop, we first transition from worker-single to
worker-partitioned, and then from vector-single to vector-partitioned, which
means state propagation from W0V0 to WAV0, and then state propagation from WAV0
to WAVA (using W for worker, V for vector, and A for all).
For branch condition propagation however, a condition calculated in
worker-single-vector-single mode is propagated from W0V0 to WAVA directly (so we use
the worker-generic buffer for that).

II.d Routines

There's a question on how to handle vector-partitionable routines in such a
scheme, given these can now be called from a context with a warp-multiple vector
length, while the current implementation of routines assumes warp-sized vector
length.  This patch series takes a conservative approach: keep routine
generation as is, and detect if we're calling a vector-partitionable routine
from an offloading region, and if so we fall back to warp-sized vector length
in that region.

III. Testing

Build and reg-tested on x86_64 with nvptx accelerator.

Build and reg-tested on x86_64 with nvptx accelerator with
PTX_DEFAULT_VECTOR_LENGTH set to various sizes.

IV. Patches

     1  [nvptx] Enable large vectors
     2  [nvptx] Update insufficient launch message for variable vector_length
     3  [nvptx] Enable large vectors -- test-cases
     4  [nvptx] Enable large vectors -- reduction testcases
     5  [nvptx] Don't emit barriers for empty loops -- test-cases
     6  [nvptx] Force vl32 if calling vector-partitionable routines -- test-cases
     7  [nvptx] Add vector_length 64 test-cases
     8  [nvptx] Enable setting vector length using -fopenacc-dim
     9  [nvptx] Enable setting vector length using -fopenacc-dim -- testcases


Tom de Vries (9):
  [nvptx] Enable large vectors
  [nvptx] Update insufficient launch message for variable vector_length
  [nvptx] Enable large vectors -- test-cases
  [nvptx] Enable large vectors -- reduction testcases
  [nvptx] Don't emit barriers for empty loops -- test-cases
  [nvptx] Force vl32 if calling vector-partitionable routines --
    test-cases
  [nvptx] Add vector_length 64 test-cases
  [nvptx] Enable setting vector length using -fopenacc-dim
  [nvptx] Enable setting vector length using -fopenacc-dim -- testcases

 gcc/config/nvptx/nvptx.c                           |  5 +-
 libgomp/plugin/plugin-nvptx.c                      | 20 +++---
 .../libgomp.oacc-c-c++-common/parallel-dims.c      |  4 +-
 .../libgomp.oacc-c-c++-common/pr85381-5.c          | 24 +++++++
 .../testsuite/libgomp.oacc-c-c++-common/pr85381.c  | 18 +++++
 .../libgomp.oacc-c-c++-common/pr85486-2.c          | 52 ++++++++++++++
 .../libgomp.oacc-c-c++-common/pr85486-3.c          | 54 +++++++++++++++
 .../testsuite/libgomp.oacc-c-c++-common/pr85486.c  | 51 ++++++++++++++
 .../vector-length-128-1.c                          |  5 +-
 .../vector-length-128-10.c                         | 39 +++++++++++
 .../vector-length-128-2.c                          | 39 +++++++++++
 .../vector-length-128-4.c                          | 40 +++++++++++
 .../vector-length-128-5.c                          | 41 +++++++++++
 .../vector-length-128-6.c                          | 41 +++++++++++
 .../vector-length-128-7.c                          | 40 +++++++++++
 .../libgomp.oacc-c-c++-common/vector-length-64-1.c | 17 +++++
 .../libgomp.oacc-c-c++-common/vector-length-64-2.c | 21 ++++++
 .../libgomp.oacc-c-c++-common/vector-length-64-3.c | 17 +++++
 .../libgomp.oacc-c-c++-common/vred2d-128.c         | 55 +++++++++++++++
 libgomp/testsuite/libgomp.oacc-fortran/gemm-2.f90  | 80 ++++++++++++++++++++++
 libgomp/testsuite/libgomp.oacc-fortran/gemm.f90    | 79 +++++++++++++++++++++
 21 files changed, 726 insertions(+), 16 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/pr85381-5.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/pr85381.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486-2.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486-3.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-10.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-2.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-4.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-5.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-6.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-64-1.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-64-2.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-64-3.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/gemm-2.f90
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/gemm.f90