mbox series

[v2,0/8] OpenMP: Unified Shared Memory via Managed Memory

Message ID 20240628102449.562467-1-ams@baylibre.com
Headers show
Series OpenMP: Unified Shared Memory via Managed Memory | expand

Message

Andrew Stubbs June 28, 2024, 10:24 a.m. UTC
These patched are an evolution of the USM portion of the patches previously
posted in July 2022 (yes, it's taken a while!)

https://patchwork.sourceware.org/project/gcc/list/?series=10748&state=%2A&archive=both

The pinned memory portion was already posted (and partially approved
already) and must be applied before this series (v5 version).

https://patchwork.sourceware.org/project/gcc/list/?series=35022&state=%2A&archive=both

The series implements OpenMP's "Unified Shared Memory" concept, first
for NVidia GPUs, and then for AMD GPUs.  We already have a very simple
implementation of USM that works on integrated APU devices and any other
device that supports shared memory access natively.  This new
implementation replaces that implementation in the case where using
"managed memory" is likely to be a win (the usual non-APU case).

In theory, explicit mapping of exactly the right memory with carefully
hand-optimized "to" and "from" directives is the most optimal implementation
(except possibly in the case where the data is too large for the device).
Experimentally, the "dumb" USM implementation we already have performs
quite well with modern devices and drivers.  This new managed memory
implementation appears to fall between the two, and can outperform
explicit mapping in the non-trivial cases (e.g. many small mappings, sparse
data, rectangular copies, etc.)

The trade-off for the additional performance is added complexity and
malloc/free is no longer compatible with external libraries (e.g. strdup).

To help mitigate these incompatibility issues, two new GNU extensions
are added:

1. ompx_gnu_unified_shared_mem_alloc / ompx_gnu_unified_shared_mem_space

  This new pre-defined allocator, used with omp_alloc, allows a
  programmer to explicitly allocate managed memory without converting
  the whole program to USM.  Creating explicit mappings for this memory is
  now optional, and if they do occur the runtime will detect the USM and apply
  no-op mappings.

2. ompx_gnu_host_mem_alloc / ompx_gnu_host_mem_space

  Conversely, this new pre-defined allocator allows a programmer to
  override "requires unified_shared_memory" and obtain regular host
  memory from the regular system heap.  This might be desirable when a
  large amount of memory is needed in a completely unrelated context, or
  for interacting with external libraries.

Known limitation: We can intercept dynamic heap allocations, but static
data and automatic stack variables are generally not accessible from the
device.  (Migrating stack pages used by an active thread seems like a
bad idea, in any case.)

I can approve the amdgcn patches myself, but comments are welcome.

OK for mainline?  (Once the pinned memory dependencies are committed.)

Thanks

Andrew

P.S. This series includes contributions from (at least) Thomas Schwinge,
Marcel Vollweiler, Kwok Cheung Yeung, and Abid Qadeer.

Andrew Stubbs (6):
  libgomp: Disentangle shared memory from managed
  openmp, nvptx: ompx_gnu_unified_shared_mem_alloc
  openmp: Enable -foffload-memory=unified
  amdgcn, openmp: Auto-detect USM mode and set HSA_XNACK
  amdgcn: libgomp plugin USM implementation
  libgomp: Map omp_default_mem_space to USM

Hafiz Abid Qadeer (1):
  openmp: Use libgomp memory allocation functions with unified shared
    memory.

Marcel Vollweiler (1):
  openmp, libgomp: Handle unified shared memory in
    omp_target_is_accessible

 gcc/c/c-parser.cc                             |  20 +-
 gcc/config/gcn/gcn.cc                         |  32 +-
 gcc/config/gcn/mkoffload.cc                   |  35 +-
 gcc/cp/parser.cc                              |  20 +-
 gcc/fortran/openmp.cc                         |  14 +-
 gcc/fortran/parse.cc                          |   3 +-
 gcc/omp-low.cc                                | 188 +++++++
 gcc/passes.def                                |   1 +
 gcc/testsuite/c-c++-common/gomp/usm-1.c       |   4 +
 gcc/testsuite/c-c++-common/gomp/usm-2.c       |  46 ++
 gcc/testsuite/c-c++-common/gomp/usm-3.c       |  44 ++
 gcc/testsuite/g++.dg/gomp/usm-1.C             |  32 ++
 gcc/testsuite/g++.dg/gomp/usm-2.C             |  30 ++
 gcc/testsuite/g++.dg/gomp/usm-3.C             |  38 ++
 gcc/testsuite/g++.dg/gomp/usm-4.C             |  32 ++
 gcc/testsuite/g++.dg/gomp/usm-5.C             |  30 ++
 gcc/testsuite/gfortran.dg/gomp/usm-1.f90      |   6 +
 gcc/testsuite/gfortran.dg/gomp/usm-2.f90      |  16 +
 gcc/testsuite/gfortran.dg/gomp/usm-3.f90      |  13 +
 gcc/tree-pass.h                               |   1 +
 include/cuda/cuda.h                           |  13 +
 include/hsa.h                                 |  28 +-
 include/hsa_ext_amd.h                         | 459 +++++++++++++++++-
 include/hsa_ext_image.h                       |   2 +-
 libgomp/Makefile.in                           |  13 +-
 libgomp/allocator.c                           |  17 +-
 libgomp/config/gcn/allocator.c                |  10 +
 libgomp/config/linux/allocator.c              |  29 +-
 libgomp/config/nvptx/allocator.c              |  10 +
 libgomp/libgomp-plugin.h                      |   4 +
 libgomp/libgomp.h                             |   8 +
 libgomp/omp.h.in                              |   4 +
 libgomp/omp_lib.f90.in                        |   8 +
 libgomp/omp_lib.h.in                          |  10 +
 libgomp/plugin/Makefrag.am                    |   2 +-
 libgomp/plugin/cuda-lib.def                   |   2 +
 libgomp/plugin/plugin-gcn.c                   | 209 +++++++-
 libgomp/plugin/plugin-nvptx.c                 |  68 ++-
 libgomp/target.c                              |  96 +++-
 libgomp/testsuite/lib/libgomp.exp             |  22 +
 libgomp/testsuite/libgomp.c++/usm-1.C         |  54 +++
 libgomp/testsuite/libgomp.c++/usm-2.C         |  33 ++
 .../libgomp.c-c++-common/requires-1.c         |   1 +
 .../libgomp.c-c++-common/requires-4.c         |   3 +
 .../libgomp.c-c++-common/requires-4a.c        |   2 +
 .../libgomp.c-c++-common/requires-5.c         |   5 +-
 .../target-implicit-map-4.c                   |  18 +
 .../target-is-accessible-1.c                  |  22 +-
 .../target-is-accessible-2.c                  |  21 +
 .../alloc-ompx_gnu_host_mem_alloc-1.c         |  77 +++
 libgomp/testsuite/libgomp.c/usm-1.c           |  26 +
 libgomp/testsuite/libgomp.c/usm-2.c           |  34 ++
 libgomp/testsuite/libgomp.c/usm-3.c           |  37 ++
 libgomp/testsuite/libgomp.c/usm-4.c           |  38 ++
 libgomp/testsuite/libgomp.c/usm-5.c           |  30 ++
 libgomp/testsuite/libgomp.c/usm-6.c           |  94 ++++
 .../target-is-accessible-1.f90                |  20 +-
 .../target-is-accessible-2.f90                |  22 +
 libgomp/testsuite/libgomp.fortran/usm-1.f90   |  28 ++
 libgomp/testsuite/libgomp.fortran/usm-2.f90   |  33 ++
 libgomp/testsuite/libgomp.fortran/usm-3.f90   |  33 ++
 libgomp/usm-allocator.c                       | 232 +++++++++
 libgomp/usmpin-allocator.c                    |   3 +
 63 files changed, 2403 insertions(+), 82 deletions(-)
 create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-1.c
 create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-2.c
 create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-3.c
 create mode 100644 gcc/testsuite/g++.dg/gomp/usm-1.C
 create mode 100644 gcc/testsuite/g++.dg/gomp/usm-2.C
 create mode 100644 gcc/testsuite/g++.dg/gomp/usm-3.C
 create mode 100644 gcc/testsuite/g++.dg/gomp/usm-4.C
 create mode 100644 gcc/testsuite/g++.dg/gomp/usm-5.C
 create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-1.f90
 create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-2.f90
 create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-3.f90
 mode change 100644 => 100755 include/hsa.h
 mode change 100644 => 100755 include/hsa_ext_amd.h
 mode change 100644 => 100755 include/hsa_ext_image.h
 create mode 100644 libgomp/testsuite/libgomp.c++/usm-1.C
 create mode 100644 libgomp/testsuite/libgomp.c++/usm-2.C
 create mode 100644 libgomp/testsuite/libgomp.c-c++-common/target-is-accessible-2.c
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-ompx_gnu_host_mem_alloc-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/usm-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/usm-2.c
 create mode 100644 libgomp/testsuite/libgomp.c/usm-3.c
 create mode 100644 libgomp/testsuite/libgomp.c/usm-4.c
 create mode 100644 libgomp/testsuite/libgomp.c/usm-5.c
 create mode 100644 libgomp/testsuite/libgomp.c/usm-6.c
 create mode 100644 libgomp/testsuite/libgomp.fortran/target-is-accessible-2.f90
 create mode 100644 libgomp/testsuite/libgomp.fortran/usm-1.f90
 create mode 100644 libgomp/testsuite/libgomp.fortran/usm-2.f90
 create mode 100644 libgomp/testsuite/libgomp.fortran/usm-3.f90
 create mode 100644 libgomp/usm-allocator.c