diff mbox series

libgomp: with USM, init 'link' variables with host address

Message ID 4d70699c-e094-414a-a202-3f1ec1d08a49@baylibre.com
State New
Headers show
Series libgomp: with USM, init 'link' variables with host address | expand

Commit Message

Tobias Burnus Sept. 14, 2024, 10:32 p.m. UTC
The idea of link variables is to replace he full device variable by a 
pointer, permitting to map only parts of the variable to the device, 
saving memory.

However, having a pointer permits for (unified) shared memory to point 
to the host variable.

That's what this patch does: instead of having a dangling pointer, upon 
loading the image, the device side pointers are updated to point to the 
host. With the current patch, this is only done when explicitly 
requesting unified-shared memory.

Tested on x86-64-gnu-linux and nvptx offloading (that supports USM).

Remarks/comments/suggestions before I commit it?

Tobias

PS: I intent to do some additional changes for improved USM handling. 
Once done, I intent to look into (a) given the user a bit more power on 
mapping vs. not mapping and (b) to use for APUs by default USM, even 
without 'requires unified_shared_memory'.

Comments

Thomas Schwinge Sept. 17, 2024, 10:34 a.m. UTC | #1
Hi Tobias!

On 2024-09-15T00:32:21+0200, Tobias Burnus <tburnus@baylibre.com> wrote:
> The idea of link variables is to replace he full device variable by a 
> pointer, permitting to map only parts of the variable to the device, 
> saving memory.
>
> However, having a pointer permits for (unified) shared memory to point 
> to the host variable.
>
> That's what this patch does: instead of having a dangling pointer, upon 
> loading the image, the device side pointers are updated to point to the 
> host. With the current patch, this is only done when explicitly 
> requesting unified-shared memory.
>
> Tested on x86-64-gnu-linux and nvptx offloading (that supports USM).

(I yet have to set up such a USM configuration...)

> Remarks/comments/suggestions before I commit it?

> libgomp: with USM, init 'link' variables with host address
>
> If requires unified_shared_memory is set, make 'declare target link'
> variables to point initially to the host pointer.
>
> libgomp/ChangeLog:
>
> 	* target.c (gomp_load_image_to_device): For requires
> 	unified_shared_memory, update 'link' vars to point to the host var.
> 	* testsuite/libgomp.c-c++-common/target-link-3.c: New test.
>
>  libgomp/target.c                                   |  5 +++
>  .../testsuite/libgomp.c-c++-common/target-link-3.c | 52 ++++++++++++++++++++++
>  2 files changed, 57 insertions(+)

> --- a/libgomp/target.c
> +++ b/libgomp/target.c
> @@ -2451,6 +2451,11 @@ gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version,
>        array->right = NULL;
>        splay_tree_insert (&devicep->mem_map, array);
>        array++;

Do I understand correctly that even if
'GOMP_REQUIRES_UNIFIED_SHARED_MEMORY', we cannot just skip all the
'mem_map' setup in 'gomp_load_image_to_device' etc., because we're not
(yet?) setting 'GOMP_OFFLOAD_CAP_SHARED_MEM'?  (I've not yet worked
through the "libgomp: Enable USM for some nvptx devices" discussion from
earlier this year.)

> +
> +      if (is_link_var
> +	  && (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY))
> +	gomp_copy_host2dev (devicep, NULL, (void *) target_var->start,
> +			    &k->host_start, sizeof (void *), false, NULL);
>      }

Calling 'gomp_copy_host2dev' looks a bit funny given we've just
determined USM (..., but I'm not asking for plain 'memcpy').

There is nothing to un-do in 'gomp_unload_image_from_device', right?

What's the advantage/rationale of doing this here vs. in
'gomp_map_vars_internal' for 'REFCOUNT_LINK'?  (May be worth a source
code comment?)

> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.c-c++-common/target-link-3.c
> @@ -0,0 +1,52 @@
> +/* { dg-do run }  */
> +
> +#include <stdint.h>
> +#include <omp.h>
> +
> +#pragma omp requires unified_shared_memory
> +

Intentionally mixing non-'static' vs. 'static' in the following?

> +int A[3] = {-3,-4,-5};
> +static int q = -401;
> +#pragma omp declare target link(A, q)
> +
> +#pragma omp begin declare target
> +void
> +f (uintptr_t *pA, uintptr_t *pq)
> +{
> +  if (A[0] != 1 || A[1] != 2 || A[2] != 3 || q != 42)
> +    __builtin_abort ();
> +  A[0] = 13;
> +  A[1] = 14;
> +  A[2] = 15;
> +  q = 23;
> +  *pA = (uintptr_t) &A[0];
> +  *pq = (uintptr_t) &q;
> +}
> +#pragma omp end declare target
> +
> +int
> +main ()
> +{
> +  uintptr_t hpA = (uintptr_t) &A[0];
> +  uintptr_t hpq = (uintptr_t) &q;
> +  uintptr_t dpA, dpq;
> +
> +  A[0] = 1;
> +  A[1] = 2;
> +  A[2] = 3;
> +  q = 42;
> +
> +  for (int i = 0; i <= omp_get_num_devices (); ++i)
> +    {
> +      #pragma omp target device(device_num: i) map(dpA, dpq)
> +	f (&dpA, &dpq);
> +      if (hpA != dpA || hpq != dpq)
> +	__builtin_abort ();
> +      if (A[0] != 13 || A[1] != 14 || A[2] != 15 || q != 23)
> +	__builtin_abort ();
> +      A[0] = 1;
> +      A[1] = 2;
> +      A[2] = 3;
> +      q = 42;
> +    }
> +}
Tobias Burnus Sept. 17, 2024, 2:32 p.m. UTC | #2
Hi Thomas,

short version: I think the patch as posted is fine and no action beyond 
is needed for this one issue.

See below for the long version.

Possibly modifications (now or as follow up):
- using memcpy + or let the plugin do it
- not adding link variables to the splay tree with 'USM'.

Thomas Schwinge wrote:
>> Tested on x86-64-gnu-linux and nvptx offloading (that supports USM).
> (I yet have to set up such a USM configuration...)

You already used an USM config, e.g., when running gfx90a (likewise: 
gfx90c), except that USM on mainline it currently only works if you 
explicitly set 'export HSA_XNACK=1'.

For Nvptx, you need a post-Volta GPU with the open-kernels driver, which 
is for newer driver versions the default.

* * *
> Do I understand correctly that even if
> 'GOMP_REQUIRES_UNIFIED_SHARED_MEMORY', we cannot just skip all the
> 'mem_map' setup in 'gomp_load_image_to_device' etc., because we're not
> (yet?) setting 'GOMP_OFFLOAD_CAP_SHARED_MEM'?

We actually do set GOMP_OFFLOAD_CAP_SHARED_MEM with 'requires 
unified_shared_memory'.

But, indeed, we cannot skip the memory mapping parts – due to the way we 
handle static variables.

* * *

>> +
>> +      if (is_link_var
>> +	  && (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY))
>> +	gomp_copy_host2dev (devicep, NULL, (void *) target_var->start,
>> +			    &k->host_start, sizeof (void *), false, NULL);
>>       }
> Calling 'gomp_copy_host2dev' looks a bit funny given we've just
> determined USM (..., but I'm not asking for plain 'memcpy').

I guess a plain memcpy would do as well. [Assuming that the device's 
static variable is host accessible, which it probably is and should be.]

I add it to my to-do list for USM-related tasks to change this; possibly 
moving it to the plugin side has some advantages? Possibly not adding it 
to the splay tree if not needed. (Cf. below for env var discussion.)

Regarding the unload: For 'declare target link(A)', we have, e.g., 
'static int *A' on the device side. Thus, we could do 'A = NULL' – and 
rather should do 'A = {clobber}', but that's rather pointless in general 
and especially when unloading the image.

> What's the advantage/rationale of doing this here vs. in
> 'gomp_map_vars_internal' for 'REFCOUNT_LINK'?  (May be worth a source
> code comment?)

(A, B, C refers to the following example.)

We don't see 'A' (or 'B') in the GOMP_target_ext call and thus not in 
gomp_map_vars_internal.

Besides: We only want to do the initialization once and not every time 
gomp_map_vars_internal is called.

I think the following program may help to understand the issue and the 
patch better.

Note: While A, B, C are 'int …[3]' on the host, on the device we only 
have 'int B[3]' while for A it's 'int *A' and C only exists on the host.

  * * *

#pragma requires unified_shared_memory

static int A[3], B[3], C[3];
#pragma omp declare target link(A) enter(B)

#pragma omp begin declare target
void f(int *p)
{
    A[2] += B[2] + p[2];  // p points to the host's C variable
}
#pragma omp end declare target

void foo(int dev) {
   int *ptr = C;
   #pragma omp target firstprivate(ptr) device(dev)
     f (ptr);
}


* * *

Here, 'ptr' (and thus 'p') point to the host 'C' variable, both before the target
region and inside the target region.

'B' points to the device local version of the variable.

And 'A' on a non-host device is likely to be NULL ('static int *A' + .BSS) before this patch.
Or pointing to the host's 'A' with this patch.

* * *

With A pointing to the host version (and likewise 'p' pointing to the host C), host fallback
and device version yield identical result for 'A' and for 'C' (via ptr/p). — However, 'B' on
host and non-host device have nothing in common. While that might be fine, in general it is not.

Hence, in order to get for a .BSS valued 'B' the same result on host and device, we need, e.g.

#pragma omp data map(always: B) device(dev)
   foo (dev);

to call 'foo' to ensure that the two 'B' are in sync.

* * *

Code wise, this means that with GOMP_OFFLOAD_CAP_SHARED_MEM, we still have
to apply the map for 'declare target enter(…)' variables, except if host
and device share the same code – but that should only be the case for
host fallback (= initial device) and, possibly, GOMP_OFFLOAD_CAP_NATIVE_EXEC.

* * *

NOTE: OpenMP still permits to honor explicit 'map' with 'requires unified_shared_memory',
only with 'self' maps, copying the data in 'map' is explicitly disallowed.

* * *

This patch + honoring 'map' for static (non-'link'?) variables even with
GOMP_OFFLOAD_CAP_SHARED_MEM where the main items for the USM follow-up patches,
I meant by "More USM cleanup/fixes/extensions to make it _more_ useful" on slide 16
of https://gcc.gnu.org/wiki/cauldron2024#cauldron2024talks.openmp_openacc_and_offloading_in_gcc

Plus, to go a bit beyond:
- offering a flag to change 'declare target enter(…)' to 'link(…)'
   [RFC: enable it by default for 'requires unified_shared_memory'?]

- switching to GOMP_OFFLOAD_CAP_SHARED_MEM by default for APUs
   (= same memory controller) for performance

- Adding a GOMP_ environment variable to toggle between mapping vs. USM
   access on systems not detected as being APUs. (That is: systems that
   support USM but use an interconnect or page migration for the memory
   access. Possibly, also overriding the USM detection for systems which
   can access the host memory but due to some own memory are not recognized
   (→ device attributes) as being USM devices.
   And possibly also forcing to honor explicit maps with requires

(Example for the latter is Andrew's gfx1103, which reports
HSA_AMD_SYSTEM_INFO_SVM_ACCESSIBLE_BY_DEFAULT = 'false'; I assume it
can still access all host memory, but I might be wrong.)

- Documenting how GCC handles this in libgomp.texi

BTW: See https://gcc.gnu.org/onlinedocs/libgomp/Offload-Target-Specifics.html
for how USM is detected for nvptx + gcn devices.

* * *

For completeness, I also intent to look at Andrew's pinned memory/(pseudo)USM
patches; they are useful but address other aspects as those listed above
→ https://gcc.gnu.org/pipermail/gcc-patches/2024-June/654331.html
→ https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652932.html

Tobias
Tobias Burnus Sept. 24, 2024, 3:50 p.m. UTC | #3
Now committed as r15-3836-g4cb20dc043cf70

Contrary to the originally posted patch, it also acts on the newer/newly 
added 'omp requires self_maps'.

In the area of (unified-)shared memory/self maps, the next step seems to 
be to do still mapping for static variables – before moving to 
refinements like how to handle implicit 'declare target' for static 
variables, …

For this piece of code, we also want to run it for APUs even when no USM 
has been requested, avoid adding those to the mapping table (for self 
maps) and do a more efficient mapping (e.g. memcpy or avoid multiple locks).

Tobias

Tobias Burnus wrote:
>
> short version: I think the patch as posted is fine and no action 
> beyond is needed for this one issue.
>
> See below for the long version.
>
> Possibly modifications (now or as follow up):
> - using memcpy + or let the plugin do it
> - not adding link variables to the splay tree with 'USM'.
>
> Thomas Schwinge wrote:
>>> Tested on x86-64-gnu-linux and nvptx offloading (that supports USM).
>> (I yet have to set up such a USM configuration...)
>
> You already used an USM config, e.g., when running gfx90a (likewise: 
> gfx90c), except that USM on mainline it currently only works if you 
> explicitly set 'export HSA_XNACK=1'.
>
> For Nvptx, you need a post-Volta GPU with the open-kernels driver, 
> which is for newer driver versions the default.
>
> * * *
>> Do I understand correctly that even if
>> 'GOMP_REQUIRES_UNIFIED_SHARED_MEMORY', we cannot just skip all the
>> 'mem_map' setup in 'gomp_load_image_to_device' etc., because we're not
>> (yet?) setting 'GOMP_OFFLOAD_CAP_SHARED_MEM'?
>
> We actually do set GOMP_OFFLOAD_CAP_SHARED_MEM with 'requires 
> unified_shared_memory'.
>
> But, indeed, we cannot skip the memory mapping parts – due to the way 
> we handle static variables.
>
> * * *
>
>>> +
>>> +      if (is_link_var
>>> +      && (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY))
>>> +    gomp_copy_host2dev (devicep, NULL, (void *) target_var->start,
>>> +                &k->host_start, sizeof (void *), false, NULL);
>>>       }
>> Calling 'gomp_copy_host2dev' looks a bit funny given we've just
>> determined USM (..., but I'm not asking for plain 'memcpy').
>
> I guess a plain memcpy would do as well. [Assuming that the device's 
> static variable is host accessible, which it probably is and should be.]
>
> I add it to my to-do list for USM-related tasks to change this; 
> possibly moving it to the plugin side has some advantages? Possibly 
> not adding it to the splay tree if not needed. (Cf. below for env var 
> discussion.)
>
> Regarding the unload: For 'declare target link(A)', we have, e.g., 
> 'static int *A' on the device side. Thus, we could do 'A = NULL' – and 
> rather should do 'A = {clobber}', but that's rather pointless in 
> general and especially when unloading the image.
>
>> What's the advantage/rationale of doing this here vs. in
>> 'gomp_map_vars_internal' for 'REFCOUNT_LINK'?  (May be worth a source
>> code comment?)
>
> (A, B, C refers to the following example.)
>
> We don't see 'A' (or 'B') in the GOMP_target_ext call and thus not in 
> gomp_map_vars_internal.
>
> Besides: We only want to do the initialization once and not every time 
> gomp_map_vars_internal is called.
>
> I think the following program may help to understand the issue and the 
> patch better.
>
> Note: While A, B, C are 'int …[3]' on the host, on the device we only 
> have 'int B[3]' while for A it's 'int *A' and C only exists on the host.
>
>  * * *
>
> #pragma requires unified_shared_memory
>
> static int A[3], B[3], C[3];
> #pragma omp declare target link(A) enter(B)
>
> #pragma omp begin declare target
> void f(int *p)
> {
>    A[2] += B[2] + p[2];  // p points to the host's C variable
> }
> #pragma omp end declare target
>
> void foo(int dev) {
>   int *ptr = C;
>   #pragma omp target firstprivate(ptr) device(dev)
>     f (ptr);
> }
>
>
> * * *
>
> Here, 'ptr' (and thus 'p') point to the host 'C' variable, both before 
> the target
> region and inside the target region.
>
> 'B' points to the device local version of the variable.
>
> And 'A' on a non-host device is likely to be NULL ('static int *A' + 
> .BSS) before this patch.
> Or pointing to the host's 'A' with this patch.
>
> * * *
>
> With A pointing to the host version (and likewise 'p' pointing to the 
> host C), host fallback
> and device version yield identical result for 'A' and for 'C' (via 
> ptr/p). — However, 'B' on
> host and non-host device have nothing in common. While that might be 
> fine, in general it is not.
>
> Hence, in order to get for a .BSS valued 'B' the same result on host 
> and device, we need, e.g.
>
> #pragma omp data map(always: B) device(dev)
>   foo (dev);
>
> to call 'foo' to ensure that the two 'B' are in sync.
>
> * * *
>
> Code wise, this means that with GOMP_OFFLOAD_CAP_SHARED_MEM, we still 
> have
> to apply the map for 'declare target enter(…)' variables, except if host
> and device share the same code – but that should only be the case for
> host fallback (= initial device) and, possibly, 
> GOMP_OFFLOAD_CAP_NATIVE_EXEC.
>
> * * *
>
> NOTE: OpenMP still permits to honor explicit 'map' with 'requires 
> unified_shared_memory',
> only with 'self' maps, copying the data in 'map' is explicitly 
> disallowed.
>
> * * *
>
> This patch + honoring 'map' for static (non-'link'?) variables even with
> GOMP_OFFLOAD_CAP_SHARED_MEM where the main items for the USM follow-up 
> patches,
> I meant by "More USM cleanup/fixes/extensions to make it _more_ 
> useful" on slide 16
> of 
> https://gcc.gnu.org/wiki/cauldron2024#cauldron2024talks.openmp_openacc_and_offloading_in_gcc
>
> Plus, to go a bit beyond:
> - offering a flag to change 'declare target enter(…)' to 'link(…)'
>   [RFC: enable it by default for 'requires unified_shared_memory'?]
>
> - switching to GOMP_OFFLOAD_CAP_SHARED_MEM by default for APUs
>   (= same memory controller) for performance
>
> - Adding a GOMP_ environment variable to toggle between mapping vs. USM
>   access on systems not detected as being APUs. (That is: systems that
>   support USM but use an interconnect or page migration for the memory
>   access. Possibly, also overriding the USM detection for systems which
>   can access the host memory but due to some own memory are not 
> recognized
>   (→ device attributes) as being USM devices.
>   And possibly also forcing to honor explicit maps with requires
>
> (Example for the latter is Andrew's gfx1103, which reports
> HSA_AMD_SYSTEM_INFO_SVM_ACCESSIBLE_BY_DEFAULT = 'false'; I assume it
> can still access all host memory, but I might be wrong.)
>
> - Documenting how GCC handles this in libgomp.texi
>
> BTW: See 
> https://gcc.gnu.org/onlinedocs/libgomp/Offload-Target-Specifics.html
> for how USM is detected for nvptx + gcn devices.
>
> * * *
>
> For completeness, I also intent to look at Andrew's pinned 
> memory/(pseudo)USM
> patches; they are useful but address other aspects as those listed above
> → https://gcc.gnu.org/pipermail/gcc-patches/2024-June/654331.html
> → https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652932.html
diff mbox series

Patch

libgomp: with USM, init 'link' variables with host address

If requires unified_shared_memory is set, make 'declare target link'
variables to point initially to the host pointer.

libgomp/ChangeLog:

	* target.c (gomp_load_image_to_device): For requires
	unified_shared_memory, update 'link' vars to point to the host var.
	* testsuite/libgomp.c-c++-common/target-link-3.c: New test.

 libgomp/target.c                                   |  5 +++
 .../testsuite/libgomp.c-c++-common/target-link-3.c | 52 ++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/libgomp/target.c b/libgomp/target.c
index 47ec36928a6..66b54fd2ab8 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2451,6 +2451,11 @@  gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version,
       array->right = NULL;
       splay_tree_insert (&devicep->mem_map, array);
       array++;
+
+      if (is_link_var
+	  && (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY))
+	gomp_copy_host2dev (devicep, NULL, (void *) target_var->start,
+			    &k->host_start, sizeof (void *), false, NULL);
     }
 
   /* Last entry is for the ICV struct variable; if absent, start = end = 0.  */
diff --git a/libgomp/testsuite/libgomp.c-c++-common/target-link-3.c b/libgomp/testsuite/libgomp.c-c++-common/target-link-3.c
new file mode 100644
index 00000000000..c707b38b7d4
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/target-link-3.c
@@ -0,0 +1,52 @@ 
+/* { dg-do run }  */
+
+#include <stdint.h>
+#include <omp.h>
+
+#pragma omp requires unified_shared_memory
+
+int A[3] = {-3,-4,-5};
+static int q = -401;
+#pragma omp declare target link(A, q)
+
+#pragma omp begin declare target
+void
+f (uintptr_t *pA, uintptr_t *pq)
+{
+  if (A[0] != 1 || A[1] != 2 || A[2] != 3 || q != 42)
+    __builtin_abort ();
+  A[0] = 13;
+  A[1] = 14;
+  A[2] = 15;
+  q = 23;
+  *pA = (uintptr_t) &A[0];
+  *pq = (uintptr_t) &q;
+}
+#pragma omp end declare target
+
+int
+main ()
+{
+  uintptr_t hpA = (uintptr_t) &A[0];
+  uintptr_t hpq = (uintptr_t) &q;
+  uintptr_t dpA, dpq;
+
+  A[0] = 1;
+  A[1] = 2;
+  A[2] = 3;
+  q = 42;
+
+  for (int i = 0; i <= omp_get_num_devices (); ++i)
+    {
+      #pragma omp target device(device_num: i) map(dpA, dpq)
+	f (&dpA, &dpq);
+      if (hpA != dpA || hpq != dpq)
+	__builtin_abort ();
+      if (A[0] != 13 || A[1] != 14 || A[2] != 15 || q != 23)
+	__builtin_abort ();
+      A[0] = 1;
+      A[1] = 2;
+      A[2] = 3;
+      q = 42;
+    }
+}