diff mbox series

libgomp: Enable USM for some nvptx devices

Message ID f99dc63b-54de-4463-a704-0a5c10bf2067@baylibre.com
State New
Headers show
Series libgomp: Enable USM for some nvptx devices | expand

Commit Message

Tobias Burnus May 28, 2024, 10:33 p.m. UTC
While most of the nvptx systems I have access to don't have the support 
for CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES, 
one has:

Tesla V100-SXM2-16GB (as installed, e.g., on ORNL's Summit) does support 
this feature. And with that feature, unified-shared memory support does 
work, presumably by handling automatic page migration when a page fault 
occurs.

Hence: Enable USM support for those. When doing so, all 'requires 
unified_shared_memory' tests of sollve_vv pass :-)

I am not quite sure whether there are unintended side effects, hence, I 
have not enabled support for it in general. In particular, 'declare 
target enter(global_var)' seems to be mishandled (I think it should be 
link + pointer updated to point to the host; cf. description for 
'self_maps'). Thus, it is not enabled by default but only when USM has 
been requested.

OK for mainline?
Comments? Remarks? Suggestions?

Tobias

PS: I guess some more USM tests should be added…

Comments

Tobias Burnus May 29, 2024, 6:20 a.m. UTC | #1
Tobias Burnus wrote:
> While most of the nvptx systems I have access to don't have the 
> support for 
> CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES, one 
> has:

Actually, CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS is sufficient. And 
I finally also found the proper webpage for this feature; I couldn't 
find it as Nvidia's documentation uses pageableMemoryAccess and not 
CU_... for that feature. The updated patch is attached.

For details: 
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements

In principle, this proper USM is supported by Grace Hopper, PowerPC9 + 
Volta (sm_70) – but for some reasons, our PPC/Volta system does not 
support it. It is also said to work with Turing (sm_75) and newer when 
using Linux Kernel's HMM and the Open Kernel Modules (newer CUDA have 
this but don't use them by default). See link above.

> I am not quite sure whether there are unintended side effects, hence, 
> I have not enabled support for it in general. In particular, 'declare 
> target enter(global_var)' seems to be mishandled (I think it should be 
> link + pointer updated to point to the host; cf. description for 
> 'self_maps'). Thus, it is not enabled by default but only when USM has 
> been requested.
OK for mainline?
Comments? Remarks? Suggestions?

Tobias
> PS: I guess some more USM tests should be added…
>
Jakub Jelinek May 29, 2024, 12:54 p.m. UTC | #2
On Wed, May 29, 2024 at 08:20:01AM +0200, Tobias Burnus wrote:
> +  if (num_devices > 0
> +      && (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY))
> +    for (int dev = 0; dev < num_devices; dev++)
> +      {
> +	int pi;
> +	CUresult r;
> +	r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi,
> +	      CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS,
> +	      dev);

Formatting nit, the CU_DEVICE_... should be below cuDeviceGetAttribute,
I think it fits like that (if it wouldn't one could use a temporary
variable).

Otherwise LGTM.

	Jakub
Andrew Stubbs June 3, 2024, 3:09 p.m. UTC | #3
On 28/05/2024 23:33, Tobias Burnus wrote:
> While most of the nvptx systems I have access to don't have the support 
> for CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES, 
> one has:
> 
> Tesla V100-SXM2-16GB (as installed, e.g., on ORNL's Summit) does support 
> this feature. And with that feature, unified-shared memory support does 
> work, presumably by handling automatic page migration when a page fault 
> occurs.
> 
> Hence: Enable USM support for those. When doing so, all 'requires 
> unified_shared_memory' tests of sollve_vv pass :-)
> 
> I am not quite sure whether there are unintended side effects, hence, I 
> have not enabled support for it in general. In particular, 'declare 
> target enter(global_var)' seems to be mishandled (I think it should be 
> link + pointer updated to point to the host; cf. description for 
> 'self_maps'). Thus, it is not enabled by default but only when USM has 
> been requested.
> 
> OK for mainline?
> Comments? Remarks? Suggestions?
> 
> Tobias
> 
> PS: I guess some more USM tests should be added…
> 


> +		/* If USM has been requested and is supported by all devices
> +		   of this type, set the capability accordingly.  */
> +		if (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY)
> +		  current_device.capabilities |= GOMP_OFFLOAD_CAP_SHARED_MEM;
> +

This breaks my USM patches that add the omp_alloc support (because it 
now short-circuits all of those code-paths), and it's just not true for 
devices where all host memory isn't magically addressable on the device.

Is there another way to detect truly shared memory?

Andrew
Tobias Burnus June 3, 2024, 4:46 p.m. UTC | #4
Andrew Stubbs wrote:
>> +        /* If USM has been requested and is supported by all devices
>> +           of this type, set the capability accordingly.  */
>> +        if (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY)
>> +          current_device.capabilities |= GOMP_OFFLOAD_CAP_SHARED_MEM;
>> +
>
> This breaks my USM patches that add the omp_alloc support (because it 
> now short-circuits all of those code-paths),

which I believe is fine. Your USM patches are for pseudo-USM, i.e. a 
(useful) bandaid for systems where the memory is not truely 
unified-shared memory but only specially tagged host memory is device 
accessible. (e.g. only memory allocated via cuMemAllocManaged) — And, 
quite similar, for -foffload-memory=pinned.

I think if a user wants to have pseudo USM – and does so by passing 
-foffload-memory=unified – we can add another flag to the internal 
omp_requires_mask. - By passing this option, a user should then also be 
aware of all the unavoidable special-case issues of pseudo-USM and 
cannot complain if they run into those.

If not, well, then the user either gets true USM (if supported) - or 
host fallback. Either of it is perfectly fine.

With -foffload-memory=unified, the compiler can then add all the 
omp_alloc calls – and, e.g., set a new GOMP_REQUIRES_OFFLOAD_MANAGED 
flag. If that's set, we wouldn't do the line above quoted capability 
setting in libgomp/target.c.

For nvidia, GOMP_REQUIRES_OFFLOAD_MANAGED probably requires 
CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS, i.e. when 0 then we 
probably want to return -1 also for -foffload-memory=unified. - A quick 
check shows that Tesla K20 (Kepler, sm_35) has 0 while Volta, Ada, 
Ampere (sm_70, sm_82, sm_89) have 1. (I recall using managed memory on 
an old system; page migration to the device worked fine, but a on-host 
accesses while the kernel was still running, crashed the program.|)
|

For amdgcn, my impression is that we don't need to handle 
-foffload-memory=unified as only the MI200 series (+ APUs) supports this 
well, but MI200 also supports true USM (with page migration; for APU it 
makes even less sense). - But, of course, we still may. — Auto-setting 
HSA_XNACK could be still be done MI200, but I wonder how to distinguish 
MI300X vs. MI300A, but it probably doesn't harm (nor help) to set 
HSA_XNACK for APUs …


> and it's just not true for devices where all host memory isn't 
> magically addressable on the device.
> Is there another way to detect truly shared memory?

Do you have any indication that the current checks become true when the 
memory is not accessible?

Tobias
Andrew Stubbs June 3, 2024, 5:04 p.m. UTC | #5
On 03/06/2024 17:46, Tobias Burnus wrote:
> Andrew Stubbs wrote:
>>> +        /* If USM has been requested and is supported by all devices
>>> +           of this type, set the capability accordingly.  */
>>> +        if (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY)
>>> +          current_device.capabilities |= GOMP_OFFLOAD_CAP_SHARED_MEM;
>>> +
>>
>> This breaks my USM patches that add the omp_alloc support (because it 
>> now short-circuits all of those code-paths),
> 
> which I believe is fine. Your USM patches are for pseudo-USM, i.e. a 
> (useful) bandaid for systems where the memory is not truely 
> unified-shared memory but only specially tagged host memory is device 
> accessible. (e.g. only memory allocated via cuMemAllocManaged) — And, 
> quite similar, for -foffload-memory=pinned.

Er, no.

The default do-nothing USM uses slow uncachable PCI memory accesses (on 
devices that don't have truly shared memory, like APUs).

The CUDA Managed Memory and AMD Coarse Grained memory implementation 
uses proper page migration and permits full-speed memory access on the 
device (just don't thrash the pages too fast).

These are very different things!

> I think if a user wants to have pseudo USM – and does so by passing 
> -foffload-memory=unified – we can add another flag to the internal 
> omp_requires_mask. - By passing this option, a user should then also be 
> aware of all the unavoidable special-case issues of pseudo-USM and 
> cannot complain if they run into those.
> 
> If not, well, then the user either gets true USM (if supported) - or 
> host fallback. Either of it is perfectly fine.
> 
> With -foffload-memory=unified, the compiler can then add all the 
> omp_alloc calls – and, e.g., set a new GOMP_REQUIRES_OFFLOAD_MANAGED 
> flag. If that's set, we wouldn't do the line above quoted capability 
> setting in libgomp/target.c.
> 
> For nvidia, GOMP_REQUIRES_OFFLOAD_MANAGED probably requires 
> CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS, i.e. when 0 then we 
> probably want to return -1 also for -foffload-memory=unified. - A quick 
> check shows that Tesla K20 (Kepler, sm_35) has 0 while Volta, Ada, 
> Ampere (sm_70, sm_82, sm_89) have 1. (I recall using managed memory on 
> an old system; page migration to the device worked fine, but a on-host 
> accesses while the kernel was still running, crashed the program.|)
> |
> 
> For amdgcn, my impression is that we don't need to handle 
> -foffload-memory=unified as only the MI200 series (+ APUs) supports this 
> well, but MI200 also supports true USM (with page migration; for APU it 
> makes even less sense). - But, of course, we still may. — Auto-setting 
> HSA_XNACK could be still be done MI200, but I wonder how to distinguish 
> MI300X vs. MI300A, but it probably doesn't harm (nor help) to set 
> HSA_XNACK for APUs …
> 
> 
>> and it's just not true for devices where all host memory isn't 
>> magically addressable on the device.
>> Is there another way to detect truly shared memory?
> 
> Do you have any indication that the current checks become true when the 
> memory is not accessible?

On AMD MI200, your check broken my USM testcases (because the code they 
were testing isn't active).  This is a serious performance problem.

Andrew
Tobias Burnus June 3, 2024, 8:40 p.m. UTC | #6
Andrew Stubbs wrote:
> On 03/06/2024 17:46, Tobias Burnus wrote:
>> Andrew Stubbs wrote:
>>>> +        /* If USM has been requested and is supported by all devices
>>>> +           of this type, set the capability accordingly. */
>>>> +        if (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY)
>>>> +          current_device.capabilities |= GOMP_OFFLOAD_CAP_SHARED_MEM;
>>>> +
>>>
>>> This breaks my USM patches that add the omp_alloc support (because 
>>> it now short-circuits all of those code-paths),
>>
>> which I believe is fine. Your USM patches are for pseudo-USM, i.e. a 
>> (useful) bandaid for systems where the memory is not truely 
>> unified-shared memory but only specially tagged host memory is device 
>> accessible. (e.g. only memory allocated via cuMemAllocManaged) — And, 
>> quite similar, for -foffload-memory=pinned.
>
> Er, no.
>
> The default do-nothing USM uses slow uncachable PCI memory accesses 
> (on devices that don't have truly shared memory, like APUs).

I have no idea what a "default do nothing USM" is – and using the PCI-E 
to transfer the data is the only option unless there is either a common 
memory controller or some other interconnect Infinity Fabric interconnect).

However, your description sounds as if you talk about pinned memory – 
which by construction cannot migrate – and not about managed memory, 
which is one of the main approaches for USM – especially as that's how 
HMM works and as it avoids to transfer any memory access.

If you use a Linux kernel with HMM and have support for it, the default 
is that upon device access, the page migrates to the GPU (using, e.g. 
PCI-E) and then stays there until the host accesses that memory page 
again, triggering a page fault and transfer back. That's the whole idea 
of HMM and works similar to the migrate to disk feature (aka swapping), 
cf. https://docs.kernel.org/mm/hmm.html

That's the very same behavior as with hipMallocManaged with XNACK 
enabled according to 
https://rocm.docs.amd.com/en/develop/conceptual/gpu-memory.html

As PowerPC + Volta (+ normal kernel) does not support USM but a system 
with + Nvlink does, I bet that on such a system, the memory stays on the 
host and Nvlink does the remote access, but I don't know how Nvlink 
handles caching. (The feature flags state that direct host-memory access 
from the device is possible.)

By contrast, for my laptop GPU (Nvidia RTX A1000) with open kernel 
drivers + CUDA drivers, I bet the memory migration will happen – 
especially as the feature flags direct host-memory access is not possible.

* * *

If host and device access data on the same memory page, page migration 
forth and back will happen continuously, which is very slow.

Also slow is if data is spread over many pages as one gets keeps getting 
page faults until the data is finally completely migrated. The solution 
in that case is a large page such that the data is transferred in 
one/few large chunks.

In general using manual allocation (x = omp_alloc(...)) with a suitable 
allocator can manually avoid the problem by using pinning or large pages 
or … Without knowing the algorithm it is hard to have a generic solution.

If there such a concurrent access issue occurs for compiler generated 
code or with the run-time library, we should definitely try to fix it; 
for user code, it is probably hopeless in the generic case.

* * *

I actually tried to find an OpenMP target-offload benchmark, possibly 
for USM, but I failed. Most seem to be either not available or seriously 
broken – when testing starts by fixing OpenMP syntax bugs, it does not 
increase the trust in the testcase. — Can you suggest a testcase?

* * *

> The CUDA Managed Memory and AMD Coarse Grained memory implementation 
> uses proper page migration and permits full-speed memory access on the 
> device (just don't thrash the pages too fast).

As written, in my understanding that is what happens with HMM kernel 
support for any memory that is not explicitly pinned. The only extra 
trick an implementation can play is pinning the page – such that it 
knows that the memory host does not change (e.g. won't migrates to the 
other NUMA memory of the CPU or to swap space) such that the memory can 
be directly accessed.

I am pretty sure that's the reason, e.g., CUDA pinned memory is faster – 
and it might also help with HMM migration if the destination is known 
not to change; no idea whether the managed memory routines play such 
tricks or not.

Another optimization opportunity exists if it is known that the memory 
won't be accessed by host until the kernel ends, but I don't see this 
guaranteed in general in user code.

* * *

> On AMD MI200, your check broken my USM testcases (because the code 
> they were testing isn't active).  This is a serious performance problem.

"I need more data." — First, a valid USM testcase should not be broken 
in the mainline. Secondly, I don't see how a generic testcase can have a 
performance issue when USM works. And, I didn't see a test fail on 
mainline when testing on an MI200 system and on Summit as PowerPC + 
Volta + Nvlink system. Admittedly, I have not run the full testsuite on 
my laptop, but I also didn't see issues when testing a subset.

Additionally, If a specific implementation is required, well, then we 
have two ways to ensure that it works: effective target checking and 
using command line arguments. I have the feeling, you would like to use 
-foffload-memory=managed (alias 'unified') for that testcase.

And finally: As I keep telling: I do believe that 
-foffload-memory=managed/pinned has its use and should land in mainline. 
But that it shouldn't be the default.

Tobias

PS: I would love to do some comparisons on, e.g., Summit, my laptop, 
MI210, MI250X of host execution vs. USM as implemented on mainline vs. 
-foffload-memory= managed USM – and, in principle, vs. mapping. But I 
first need to find a suitable benchmark which is somewhat compute heavy 
and doesn't only test data transfer (like BabelStream).
Andrew Stubbs June 4, 2024, 8:51 a.m. UTC | #7
On 03/06/2024 21:40, Tobias Burnus wrote:
> Andrew Stubbs wrote:
>> On 03/06/2024 17:46, Tobias Burnus wrote:
>>> Andrew Stubbs wrote:
>>>>> +        /* If USM has been requested and is supported by all devices
>>>>> +           of this type, set the capability accordingly. */
>>>>> +        if (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY)
>>>>> +          current_device.capabilities |= GOMP_OFFLOAD_CAP_SHARED_MEM;
>>>>> +
>>>>
>>>> This breaks my USM patches that add the omp_alloc support (because 
>>>> it now short-circuits all of those code-paths),
>>>
>>> which I believe is fine. Your USM patches are for pseudo-USM, i.e. a 
>>> (useful) bandaid for systems where the memory is not truely 
>>> unified-shared memory but only specially tagged host memory is device 
>>> accessible. (e.g. only memory allocated via cuMemAllocManaged) — And, 
>>> quite similar, for -foffload-memory=pinned.
>>
>> Er, no.
>>
>> The default do-nothing USM uses slow uncachable PCI memory accesses 
>> (on devices that don't have truly shared memory, like APUs).
> 
> I have no idea what a "default do nothing USM" is – and using the PCI-E 
> to transfer the data is the only option unless there is either a common 
> memory controller or some other interconnect Infinity Fabric interconnect).

"Do nothing USM" is when you don't do anything special and expect it to 
Just Work. So, use plain malloc as usual, not Managed Memory.

AMD has "fine grained" and "coarse grained" memory. The default is fine 
grained (or completely unshared), and in that mode the GPU accesses host 
memory on demand, one load/store instruction at a time. It does not 
migrate those pages; they always live in host memory. These accesses are 
slow, but transfer less memory and don't incur the OS/driver overhead 
cost of a full page-miss exception (nor do they require XNACK aware 
code), but they can win for occasional access (such as loading initial 
kernel parameters).

Coarse grained memory is where it gets interesting for USM. Before USM, 
allocating coarse grained memory meant allocating device-side memory. 
After USM, with HSA_XNACK enabled, host-side pages can also be 
registered as coarse grained memory, and it's these pages that 
auto-migrate. *Only* these pages. This is what hipMallocManaged does, 
and this is what OG13 and my patches do.

> However, your description sounds as if you talk about pinned memory – 
> which by construction cannot migrate – and not about managed memory, 
> which is one of the main approaches for USM – especially as that's how 
> HMM works and as it avoids to transfer any memory access.

No, for NVidia we use Cuda Managed Memory, and for AMD we implement our 
own "libgomp managed memory".

> If you use a Linux kernel with HMM and have support for it, the default 
> is that upon device access, the page migrates to the GPU (using, e.g. 
> PCI-E) and then stays there until the host accesses that memory page 
> again, triggering a page fault and transfer back. That's the whole idea 
> of HMM and works similar to the migrate to disk feature (aka swapping), 
> cf. https://docs.kernel.org/mm/hmm.html

Nope, that's not the default on AMD. The fact that Cuda Managed Memory 
exists suggests it's also not the default there, but I'm not sure about 
that.

> That's the very same behavior as with hipMallocManaged with XNACK 
> enabled according to 
> https://rocm.docs.amd.com/en/develop/conceptual/gpu-memory.html

Only when you explicitly use hipMallocManaged.

> As PowerPC + Volta (+ normal kernel) does not support USM but a system 
> with + Nvlink does, I bet that on such a system, the memory stays on the 
> host and Nvlink does the remote access, but I don't know how Nvlink 
> handles caching. (The feature flags state that direct host-memory access 
> from the device is possible.)
> 
> By contrast, for my laptop GPU (Nvidia RTX A1000) with open kernel 
> drivers + CUDA drivers, I bet the memory migration will happen – 
> especially as the feature flags direct host-memory access is not possible
I'm not convinced, but the NVidia side of things is much less clear to me.

One thing I learned from the pinned memory experience is that Cuda runs 
faster if you use its APIs to manage memory.

> * * *
> 
> If host and device access data on the same memory page, page migration 
> forth and back will happen continuously, which is very slow.

Which is why the new version of my patches (that I plan to post soon, 
but this issue needs to be resolved) are careful to keep migrateable 
pages separated from the main heap. Unfortunately, "require 
unified_shared_memory" is a blunt instrument and proper separation is 
generally impossible, but at least library data is separated (such as 
the HSA runtime!)

> Also slow is if data is spread over many pages as one gets keeps getting 
> page faults until the data is finally completely migrated. The solution 
> in that case is a large page such that the data is transferred in 
> one/few large chunks.

True, USM can rarely beat carefully planned explicit mappings (the 
exception perhaps being large quantities of sparsely used data).

> In general using manual allocation (x = omp_alloc(...)) with a suitable 
> allocator can manually avoid the problem by using pinning or large pages 
> or … Without knowing the algorithm it is hard to have a generic solution.
> 
> If there such a concurrent access issue occurs for compiler generated 
> code or with the run-time library, we should definitely try to fix it; 
> for user code, it is probably hopeless in the generic case.
> 
> * * *
> 
> I actually tried to find an OpenMP target-offload benchmark, possibly 
> for USM, but I failed. Most seem to be either not available or seriously 
> broken – when testing starts by fixing OpenMP syntax bugs, it does not 
> increase the trust in the testcase. — Can you suggest a testcase?

I don't have a good one, but a dumb memory copy should show the 
difference between fine and coarse grained memory.

> * * *
> 
>> The CUDA Managed Memory and AMD Coarse Grained memory implementation 
>> uses proper page migration and permits full-speed memory access on the 
>> device (just don't thrash the pages too fast).
> 
> As written, in my understanding that is what happens with HMM kernel 
> support for any memory that is not explicitly pinned. The only extra 
> trick an implementation can play is pinning the page – such that it 
> knows that the memory host does not change (e.g. won't migrates to the 
> other NUMA memory of the CPU or to swap space) such that the memory can 
> be directly accessed.
> 
> I am pretty sure that's the reason, e.g., CUDA pinned memory is faster – 
> and it might also help with HMM migration if the destination is known 
> not to change; no idea whether the managed memory routines play such 
> tricks or not.

I think I already explained this, for AMD at least.

> Another optimization opportunity exists if it is known that the memory 
> won't be accessed by host until the kernel ends, but I don't see this 
> guaranteed in general in user code.
> 
> * * *
> 
>> On AMD MI200, your check broken my USM testcases (because the code 
>> they were testing isn't active).  This is a serious performance problem.
> 
> "I need more data." — First, a valid USM testcase should not be broken 
> in the mainline. Secondly, I don't see how a generic testcase can have a 
> performance issue when USM works. And, I didn't see a test fail on 
> mainline when testing on an MI200 system and on Summit as PowerPC + 
> Volta + Nvlink system. Admittedly, I have not run the full testsuite on 
> my laptop, but I also didn't see issues when testing a subset.

The testcase concerns the new ompx_host_mem_alloc that is explicitly 
intended to allocate memory that is compatible with libraries using 
malloc when the compiler is intercepting malloc calls.

In this case an explicit mapping is supposed to work as if USM isn't 
enabled (unless the memory is truly shared), but your patch can't tell 
the difference between malloc-intercepted USM and true shared memory, so 
it disables the mapping for ompx_host_mem_space also.

> Additionally, If a specific implementation is required, well, then we 
> have two ways to ensure that it works: effective target checking and 
> using command line arguments. I have the feeling, you would like to use 
> -foffload-memory=managed (alias 'unified') for that testcase.

We could use "managed" but that's not the OpenMP term. The intended 
purpose of "-foffload-memory=unified" is to be *precisely* the same as 
"require unified_shared_memory" because it was perceived that USM could 
improve the performance of some benchmarks, but you're not allowed to 
modify the source code.

> And finally: As I keep telling: I do believe that 
> -foffload-memory=managed/pinned has its use and should land in mainline. 
> But that it shouldn't be the default.

No, it's absolutely not the default.

> 
> Tobias
> 
> PS: I would love to do some comparisons on, e.g., Summit, my laptop, 
> MI210, MI250X of host execution vs. USM as implemented on mainline vs. 
> -foffload-memory= managed USM – and, in principle, vs. mapping. But I 
> first need to find a suitable benchmark which is somewhat compute heavy 
> and doesn't only test data transfer (like BabelStream).

Actually, I think testing only data transfer is fine for this, but we 
might like to try some different access patterns, besides straight 
linear copies.

Andrew
Tobias Burnus June 4, 2024, 2:30 p.m. UTC | #8
Andrew Stubbs wrote:

>> PS: I would love to do some comparisons [...]
> Actually, I think testing only data transfer is fine for this, but we
> might like to try some different access patterns, besides straight
> linear copies.

I have now tried it on my laptop with BabelStream,https://github.com/UoB-HPC/BabelStream

Compiling with:
echo "#pragma omp requires unified_shared_memory" > omp-usm.h
cmake -DMODEL=omp -DCMAKE_CXX_COMPILER=$HOME/projects/gcc-trunk-offload/bin/g++ \
       -DCXX_EXTRA_FLAGS="-g -include ../omp-usm.h -foffload=nvptx-none -fopenmp" -DOFFLOAD=ON ..

(and the variants: no -include (→ map) + -DOFFLOAD=OFF (= host), and with hostfallback,
via env var (or usm-14 by due to lacking support.)

For mainline, I get (either with libgomp.so of mainline or GCC 14, i.e. w/o USM support):

host-14.log                     195.84user 0.94system 0 11.20elapsed 1755%CPU (0avgtext+0avgdata 1583268maxresident)k
host-mainline.log               200.16user 1.00system 0 11.89elapsed 1691%CPU (0avgtext+0avgdata 1583272maxresident)k
hostfallback-mainline.log       288.99user 4.57system 0 19.39elapsed 1513%CPU (0avgtext+0avgdata 1583972maxresident)k
usm-14.log                      279.91user 5.38system 0 19.57elapsed 1457%CPU (0avgtext+0avgdata 1590168maxresident)k
map-14.log                      4.17user 0.45system 0   03.58elapsed 129%CPU (0avgtext+0avgdata 1691152maxresident)k
map-mainline.log                4.15user 0.44system 0   03.58elapsed 128%CPU (0avgtext+0avgdata 1691260maxresident)k
usm-mainline.log                3.63user 1.96system 0   03.88elapsed 144%CPU (0avgtext+0avgdata 1692068maxresident)k

Thus: GPU is faster than host, host fallback takes 40% longer than doing host compilation.
USM is 15% faster than mapping.


With OG13, the pattern is similar, except that USM is only 3% faster. Thus, HMM seems to win my my laptop.

host-og13.log                   191.51user 0.70system 0 09.80elapsed 1960%CPU (0avgtext+0avgdata 1583280maxresident)k
map-hostfallback-og13.log       205.12user 1.09system 0 10.82elapsed 1905%CPU (0avgtext+0avgdata 1585092maxresident)k
usm-hostfallback-og13.log       338.82user 4.60system 0 19.34elapsed 1775%CPU (0avgtext+0avgdata 1584580maxresident)k
map-og13.log                    4.43user 0.42system 0   03.59elapsed 135%CPU (0avgtext+0avgdata 1692692maxresident)k
usm-og13.log                    4.31user 1.18system 0   03.68elapsed 149%CPU (0avgtext+0avgdata 1686256maxresident)k

* * *

I planned to try an AMD Instinct MI200 device, but due to two IT issues, I cannot.
(Shutdown for maintenance of the MI250X system and an NFS issues for the MI210 run,
but being unable to reboot due to the absence of a colleague having tons of editors
still open).

Tobias
Tobias Burnus June 5, 2024, 9:41 a.m. UTC | #9
Hi Andrew, hello world,

Now with AMD Instinct MI200 data - see below.

And a better look at the numbers. In terms of USM,
there does not seem to be any clear winner of both
approaches. If we want to draw conclusions, definitely
more runs are needed (statistics):

The runs below show that the differences between runs
can be larger than the effect of mapping vs. USM.
And that OG13's USM was be 40% slower on MI210
(compared with mainline or OG13 'map') while
mainline's USM is about as fast as 'map' (OG13 or mainline)
is not consistent with the MI250X result, were both USM are
slower with mainline's USM being much slower with ~30%
than OG13 with 12%.



Tobias Burnus wrote:

> I have now tried it on my laptop with BabelStream,https://github.com/UoB-HPC/BabelStream
>
> Compiling with:
> echo "#pragma omp requires unified_shared_memory" > omp-usm.h
> cmake -DMODEL=omp -DCMAKE_CXX_COMPILER=$HOME/projects/gcc-trunk-offload/bin/g++ \
>        -DCXX_EXTRA_FLAGS="-g -include ../omp-usm.h -foffload=nvptx-none -fopenmp" -DOFFLOAD=ON ..
>
> (and the variants: no -include (→ map) + -DOFFLOAD=OFF (= host), and with hostfallback,
> via env var (or usm-14 by due to lacking support.)
>
> For mainline, I get (either with libgomp.so of mainline or GCC 14, i.e. w/o USM support):
> host-14.log                     195.84user 0.94system 0 11.20elapsed 1755%CPU (0avgtext+0avgdata 1583268maxresident)k
> host-mainline.log               200.16user 1.00system 0 11.89elapsed 1691%CPU (0avgtext+0avgdata 1583272maxresident)k
> hostfallback-mainline.log       288.99user 4.57system 0 19.39elapsed 1513%CPU (0avgtext+0avgdata 1583972maxresident)k
> usm-14.log                      279.91user 5.38system 0 19.57elapsed 1457%CPU (0avgtext+0avgdata 1590168maxresident)k
> map-14.log                      4.17user 0.45system 0   03.58elapsed 129%CPU (0avgtext+0avgdata 1691152maxresident)k
> map-mainline.log                4.15user 0.44system 0   03.58elapsed 128%CPU (0avgtext+0avgdata 1691260maxresident)k
> usm-mainline.log                3.63user 1.96system 0   03.88elapsed 144%CPU (0avgtext+0avgdata 1692068maxresident)k
>
> Thus: GPU is faster than host, host fallback takes 40% longer than doing host compilation.
> USM is 15% faster than mapping.

Correction: I shouldn't look at user time but at elapsed time. For the 
latter, USM is 8% slower on mainline; hostfallback is ~70% slower than 
host execution.

> With OG13, the pattern is similar, except that USM is only 3% faster.
Here, USM (elapsed) is 2.5% faster. It is a bit difficult to compare the 
results as OG13 is faster for mapping and USM, which makes 
distinguishing OG13 vs mainline performance and the two different USM 
approaches difficult.
> host-og13.log                   191.51user 0.70system 0 09.80elapsed 1960%CPU (0avgtext+0avgdata 1583280maxresident)k
> map-hostfallback-og13.log       205.12user 1.09system 0 10.82elapsed 1905%CPU (0avgtext+0avgdata 1585092maxresident)k
> usm-hostfallback-og13.log       338.82user 4.60system 0 19.34elapsed 1775%CPU (0avgtext+0avgdata 1584580maxresident)k
> map-og13.log                    4.43user 0.42system 0   03.59elapsed 135%CPU (0avgtext+0avgdata 1692692maxresident)k
> usm-og13.log                    4.31user 1.18system 0   03.68elapsed 149%CPU (0avgtext+0avgdata 1686256maxresident)k
>
> * * *

As IT issues are now solved:

(A) On  AMD Instinct MI210 (gfx90a)

The host fallback is here very slow with elapsed time 24s vs. 1.6s for host execution.
map and USM seem to be in the same ballpark.
For two 'map' runs, I see a difference of 8%, the USM times are between those map results.

I see similar results for OG13 than mainline, except for USM which is ~40% slower (elapse time)
than map (OG13 or mainline - or mainline's USM).

host-mainline-2.log             194.00user 7.21system 0 01.44elapsed 13954%CPU (0avgtext+0avgdata 1320960maxresident)k
host-mainline.log               221.53user 5.58system 0 01.78elapsed 12716%CPU (0avgtext+0avgdata 1318912maxresident)k
hostfallback-mainline-1.log     3073.35user 146.22system 0      24.25elapsed 13272%CPU (0avgtext+0avgdata 1644544maxresident)k
hostfallback-mainline-2.log     2268.62user 146.13system 0      23.39elapsed 10320%CPU (0avgtext+0avgdata 1650544maxresident)k
map-mainline-1.log              5.38user 16.16system 0  03.00elapsed 716%CPU (0avgtext+0avgdata 1714936maxresident)k
map-mainline-2.log              5.12user 15.93system 0  02.74elapsed 768%CPU (0avgtext+0avgdata 1714932maxresident)k
usm-mainline-1.log              7.61user 2.30system 0   02.89elapsed 342%CPU (0avgtext+0avgdata 1716984maxresident)k
usm-mainline-2.log              7.75user 2.92system 0   02.89elapsed 369%CPU (0avgtext+0avgdata 1716980maxresident)k

host-og13-1.log                 213.69user 6.37system 0 01.56elapsed 14026%CPU (0avgtext+0avgdata 1316864maxresident)k
hostfallback-map-og13-1.log     3026.68user 123.77system 0      23.69elapsed 13295%CPU (0avgtext+0avgdata 1642496maxresident)k
hostfallback-map-og13-2.log     3118.71user 123.81system 0      24.49elapsed 13235%CPU (0avgtext+0avgdata 1628160maxresident)k
hostfallback-usm-og13-1.log     3070.33user 116.23system 0      23.86elapsed 13354%CPU (0avgtext+0avgdata 1648632maxresident)k
hostfallback-usm-og13-2.log     3112.34user 125.54system 0      24.39elapsed 13273%CPU (0avgtext+0avgdata 1622012maxresident)k
map-og13-1.log                  5.61user 7.13system 0   02.69elapsed 472%CPU (0avgtext+0avgdata 1716984maxresident)k
map-og13-2.log                  5.39user 16.25system 0  02.83elapsed 764%CPU (0avgtext+0avgdata 1716984maxresident)k
usm-og13-1.log                  7.23user 3.13system 0   04.37elapsed 237%CPU (0avgtext+0avgdata 1716964maxresident)k
usm-og13-2.log                  7.31user 3.15system 0   03.98elapsed 262%CPU (0avgtext+0avgdata 1716964maxresident)k

* * *

Running it on MI250X:

USM is in the sam ballpark as MAP – but here USM is actually 30% or 12% slower than map.

omp-stream-mainline-map
7.24user 0.71system 0:01.18elapsed 672%CPU (0avgtext+0avgdata 1728852maxresident)k

omp-stream-mainline-usm
2.48user 1.07system 0:01.44elapsed 247%CPU (0avgtext+0avgdata 1728916maxresident)k

omp-stream-og13-map
7.14user 0.72system 0:01.10elapsed 712%CPU (0avgtext+0avgdata 1728708maxresident)k

omp-stream-og13-usm
2.32user 0.91system 0:01.23elapsed 262%CPU (0avgtext+0avgdata 1991180maxresident)k


Tobias
diff mbox series

Patch

libgomp: Enable USM for some nvptx devices

A few high-end nvptx devices support the attribute
CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES;
for those, unified shared memory is supported in hardware. This
patch enables support for those - if all installed nvptx devices
have this feature (as the capabilities are per device type).

This exposes a bug in gomp_copy_back_icvs as it did before use
omp_get_mapped_ptr to find mapped variables, but that returns
the unchanged pointer in cased of shared memory. But in this case,
we have a few actually mapped pointers - like the ICV variables.
Additionally, there was a mismatch with regards to '-1' for the
device number as gomp_copy_back_icvs and omp_get_mapped_ptr count
differently. Hence, do the lookup manually.

include/ChangeLog:

	* cuda/cuda.h
	(CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES):
	Add.

libgomp/ChangeLog:

	* libgomp.texi (nvptx): Update USM description.
	* plugin/plugin-nvptx.c (GOMP_OFFLOAD_get_num_devices):
	Claim support when requesting USM and all devices support 
	CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES.
	* target.c (gomp_copy_back_icvs): Fix device ptr lookup.
	(gomp_target_init): Set GOMP_OFFLOAD_CAP_SHARED_MEM is the
	devices supports USM.

 include/cuda/cuda.h           |  3 ++-
 libgomp/libgomp.texi          |  5 ++++-
 libgomp/plugin/plugin-nvptx.c | 15 +++++++++++++++
 libgomp/target.c              | 24 +++++++++++++++++++++++-
 4 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h
index 0dca4b3a5c0..db640d20366 100644
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@@ -83,7 +83,8 @@  typedef enum {
   CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,
   CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,
   CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING = 41,
-  CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82
+  CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82,
+  CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES = 100
 } CUdevice_attribute;
 
 enum {
diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index 71d62105a20..e0d37f67983 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -6435,7 +6435,10 @@  The implementation remark:
       the next reverse offload region is only executed after the previous
       one returned.
 @item OpenMP code that has a @code{requires} directive with
-      @code{unified_shared_memory} will remove any nvptx device from the
+      @code{unified_shared_memory} will run on nvptx devices if and only if
+      all of those support the
+      @code{CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES}
+      attribute; otherwise, all nvptx device are removed from the
       list of available devices (``host fallback'').
 @item The default per-warp stack size is 128 kiB; see also @code{-msoft-stack}
       in the GCC manual.
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 5aad3448a8d..c4b0f5dd4bf 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1201,8 +1201,23 @@  GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
   if (num_devices > 0
       && ((omp_requires_mask
 	   & ~(GOMP_REQUIRES_UNIFIED_ADDRESS
+	       | GOMP_REQUIRES_UNIFIED_SHARED_MEMORY
 	       | GOMP_REQUIRES_REVERSE_OFFLOAD)) != 0))
     return -1;
+  /* Check whether automatic page migration is supported; if so, enable USM.
+     Currently, capabilities is per device type, hence, check all devices.  */
+  if (num_devices > 0
+      && (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY))
+    for (int dev = 0; dev < num_devices; dev++)
+      {
+	int pi;
+	CUresult r;
+	r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi,
+	      CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES,
+	      dev);
+	if (r != CUDA_SUCCESS || pi == 0)
+	  return -1;
+      }
   return num_devices;
 }
 
diff --git a/libgomp/target.c b/libgomp/target.c
index 5ec19ae489e..48689920d4a 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2969,8 +2969,25 @@  gomp_copy_back_icvs (struct gomp_device_descr *devicep, int device)
   if (item == NULL)
     return;
 
+  gomp_mutex_lock (&devicep->lock);
+
+  struct splay_tree_s *mem_map = &devicep->mem_map;
+  struct splay_tree_key_s cur_node;
+  void *dev_ptr = NULL;
+
   void *host_ptr = &item->icvs;
-  void *dev_ptr = omp_get_mapped_ptr (host_ptr, device);
+  cur_node.host_start = (uintptr_t) host_ptr;
+  cur_node.host_end = cur_node.host_start;
+  splay_tree_key n = gomp_map_0len_lookup (mem_map, &cur_node);
+
+  if (n)
+    {
+      uintptr_t offset = cur_node.host_start - n->host_start;
+      dev_ptr = (void *) (n->tgt->tgt_start + n->tgt_offset + offset);
+    }
+
+  gomp_mutex_unlock (&devicep->lock);
+
   if (dev_ptr != NULL)
     gomp_copy_dev2host (devicep, NULL, host_ptr, dev_ptr,
 			sizeof (struct gomp_offload_icvs));
@@ -5303,6 +5320,11 @@  gomp_target_init (void)
 	      {
 		/* Augment DEVICES and NUM_DEVICES.  */
 
+		/* If USM has been requested and is supported by all devices
+		   of this type, set the capability accordingly.  */
+		if (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY)
+		  current_device.capabilities |= GOMP_OFFLOAD_CAP_SHARED_MEM;
+
 		devs = realloc (devs, (num_devs + new_num_devs)
 				      * sizeof (struct gomp_device_descr));
 		if (!devs)