Message ID | 20201026141448.109041-1-julian@codesourcery.com |
---|---|
State | New |
Headers | show |
Series | nvptx: Cache stacks block for OpenMP kernel launch | expand |
On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote: > This patch adds caching for the stack block allocated for offloaded > OpenMP kernel launches on NVPTX. This is a performance optimisation -- > we observed an average 11% or so performance improvement with this patch > across a set of accelerated GPU benchmarks on one machine (results vary > according to individual benchmark and with hardware used). > > A given kernel launch will reuse the stack block from the previous launch > if it is large enough, else it is freed and reallocated. A slight caveat > is that memory will not be freed until the device is closed, so e.g. if > code is using highly variable launch geometries and large amounts of > GPU RAM, you might run out of resources slightly quicker with this patch. > > Another way this patch gains performance is by omitting the > synchronisation at the end of an OpenMP offload kernel launch -- it's > safe for the GPU and CPU to continue executing in parallel at that point, > because e.g. copies-back from the device will be synchronised properly > with kernel completion anyway. > > In turn, the last part necessitates a change to the way "(perhaps abort > was called)" errors are detected and reported. > > Tested with offloading to NVPTX. OK for mainline? I'm afraid I don't know the plugin nor CUDA well enough to review this properly (therefore I'd like to hear from Thomas, Tom and/or Alexander. Anyway, just two questions, wouldn't it make sense to add some upper bound limit over which it wouldn't cache the stacks, so that it would cache most of the time for normal programs but if some kernel is really excessive and then many normal ones wouldn't result in memory allocation failures? And, in which context are cuStreamAddCallback registered callbacks run? E.g. if it is inside of asynchronous interrput, using locking in there might not be the best thing to do. > - r = CUDA_CALL_NOCHECK (cuCtxSynchronize, ); > - if (r == CUDA_ERROR_LAUNCH_FAILED) > - GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r), > - maybe_abort_msg); > - else if (r != CUDA_SUCCESS) > - GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r)); > - nvptx_stacks_free (stacks, teams * threads); > + CUDA_CALL_ASSERT (cuStreamAddCallback, NULL, nvptx_stacks_release, > + (void *) ptx_dev, 0); > } > > /* TODO: Implement GOMP_OFFLOAD_async_run. */ > -- > 2.28.0 Jakub
(Apologies if threading is broken, for some reason I didn't receive this reply directly!) On Mon Oct 26 14:26:34 GMT 2020, Jakub Jelinek wrote: > On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote: > > This patch adds caching for the stack block allocated for offloaded > > OpenMP kernel launches on NVPTX. This is a performance optimisation > > -- we observed an average 11% or so performance improvement with > > this patch across a set of accelerated GPU benchmarks on one > > machine (results vary according to individual benchmark and with > > hardware used). > > > > A given kernel launch will reuse the stack block from the previous > > launch if it is large enough, else it is freed and reallocated. A > > slight caveat is that memory will not be freed until the device is > > closed, so e.g. if code is using highly variable launch geometries > > and large amounts of GPU RAM, you might run out of resources > > slightly quicker with this patch. > > > > Another way this patch gains performance is by omitting the > > synchronisation at the end of an OpenMP offload kernel launch -- > > it's safe for the GPU and CPU to continue executing in parallel at > > that point, because e.g. copies-back from the device will be > > synchronised properly with kernel completion anyway. > > > > In turn, the last part necessitates a change to the way "(perhaps > > abort was called)" errors are detected and reported. > > > > Tested with offloading to NVPTX. OK for mainline? > > I'm afraid I don't know the plugin nor CUDA well enough to review this > properly (therefore I'd like to hear from Thomas, Tom and/or > Alexander. Anyway, just two questions, wouldn't it make sense to add > some upper bound limit over which it wouldn't cache the stacks, so > that it would cache most of the time for normal programs but if some > kernel is really excessive and then many normal ones wouldn't result > in memory allocation failures? Yes, that might work -- another idea is to free the stacks then retry if a memory allocation fails, though that might lead to worse fragmentation, perhaps. For the upper bound idea we'd need to pick a sensible maximum limit. Something like 16MB maybe? Or, user-controllable or some fraction of the GPU's total memory? > And, in which context are cuStreamAddCallback registered callbacks > run? E.g. if it is inside of asynchronous interrput, using locking in > there might not be the best thing to do. The cuStreamAddCallback API is documented here: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483 We're quite limited in what we can do in the callback function since "Callbacks must not make any CUDA API calls". So what *can* a callback function do? It is mentioned that the callback function's execution will "pause" the stream it is logically running on. So can we get deadlock, e.g. if multiple host threads are launching offload kernels simultaneously? I don't think so, but I don't know how to prove it! Thanks, Julian
On 2020/10/27 9:17 PM, Julian Brown wrote: >> And, in which context are cuStreamAddCallback registered callbacks >> run? E.g. if it is inside of asynchronous interrput, using locking in >> there might not be the best thing to do. > The cuStreamAddCallback API is documented here: > > https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483 > > We're quite limited in what we can do in the callback function since > "Callbacks must not make any CUDA API calls". So what*can* a callback > function do? It is mentioned that the callback function's execution will > "pause" the stream it is logically running on. So can we get deadlock, > e.g. if multiple host threads are launching offload kernels > simultaneously? I don't think so, but I don't know how to prove it! I think it's not deadlock that's a problem here, but that the locking acquiring in nvptx_stack_acquire will effectively serialize GPU kernel execution to just one host thread (since you're holding it till kernel completion). Also in that case, why do you need to use a CUDA callback? You can just call the unlock directly afterwards. I think a better way is to use a list of stack blocks in ptx_dev, and quickly retrieve/unlock it in nvptx_stack_acquire, like how we did it in GOMP_OFFLOAD_alloc for general device memory allocation. Chung-Lin
On Wed, 28 Oct 2020 15:25:56 +0800 Chung-Lin Tang <cltang@codesourcery.com> wrote: > On 2020/10/27 9:17 PM, Julian Brown wrote: > >> And, in which context are cuStreamAddCallback registered callbacks > >> run? E.g. if it is inside of asynchronous interrput, using locking > >> in there might not be the best thing to do. > > The cuStreamAddCallback API is documented here: > > > > https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483 > > > > We're quite limited in what we can do in the callback function since > > "Callbacks must not make any CUDA API calls". So what*can* a > > callback function do? It is mentioned that the callback function's > > execution will "pause" the stream it is logically running on. So > > can we get deadlock, e.g. if multiple host threads are launching > > offload kernels simultaneously? I don't think so, but I don't know > > how to prove it! > > I think it's not deadlock that's a problem here, but that the locking > acquiring in nvptx_stack_acquire will effectively serialize GPU > kernel execution to just one host thread (since you're holding it > till kernel completion). Also in that case, why do you need to use a > CUDA callback? You can just call the unlock directly afterwards. IIUC, there's a single GPU queue used for synchronous launches no matter which host thread initiates the operation, and kernel execution is serialised anyway, so that shouldn't be a problem. The only way to get different kernels executing simultaneously is to use different CUDA streams -- but I think that's still TBD for OpenMP ("TODO: Implement GOMP_OFFLOAD_async_run"). > I think a better way is to use a list of stack blocks in ptx_dev, and > quickly retrieve/unlock it in nvptx_stack_acquire, like how we did it > in GOMP_OFFLOAD_alloc for general device memory allocation. If it weren't for the serialisation, we could also keep a stack cache per-host-thread in nvptx_thread. But as it is, I don't think we need the extra complication. When we do OpenMP async support, maybe a stack cache can be put per-stream in goacc_asyncqueue or the OpenMP equivalent. Thanks, Julian
On Mon, 26 Oct 2020, Jakub Jelinek wrote: > On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote: > > This patch adds caching for the stack block allocated for offloaded > > OpenMP kernel launches on NVPTX. This is a performance optimisation -- > > we observed an average 11% or so performance improvement with this patch > > across a set of accelerated GPU benchmarks on one machine (results vary > > according to individual benchmark and with hardware used). In this patch you're folding two changes together: reuse of allocated stacks and removing one host-device synchronization. Why is that? Can you report performance change separately for each change (and split out the patches)? > > A given kernel launch will reuse the stack block from the previous launch > > if it is large enough, else it is freed and reallocated. A slight caveat > > is that memory will not be freed until the device is closed, so e.g. if > > code is using highly variable launch geometries and large amounts of > > GPU RAM, you might run out of resources slightly quicker with this patch. > > > > Another way this patch gains performance is by omitting the > > synchronisation at the end of an OpenMP offload kernel launch -- it's > > safe for the GPU and CPU to continue executing in parallel at that point, > > because e.g. copies-back from the device will be synchronised properly > > with kernel completion anyway. I don't think this explanation is sufficient. My understanding is that OpenMP forbids the host to proceed asynchronously after the target construct unless it is a 'target nowait' construct. This may be observable if there's a printf in the target region for example (or if it accesses memory via host pointers). So this really needs to be a separate patch with more explanation why this is okay (if it is okay). > > In turn, the last part necessitates a change to the way "(perhaps abort > > was called)" errors are detected and reported. As already mentioned using callbacks is problematic. Plus, I'm sure the way you lock out other threads is a performance loss when multiple threads have target regions: even though they will not run concurrently on the GPU, you still want to allow host threads to submit GPU jobs while the GPU is occupied. I would suggest to have a small pool (up to 3 entries perhaps) of stacks. Then you can arrange reuse without totally serializing host threads on target regions. Alexander
Hi Alexander, Thanks for the review! Comments below. On Tue, 10 Nov 2020 00:32:36 +0300 Alexander Monakov <amonakov@ispras.ru> wrote: > On Mon, 26 Oct 2020, Jakub Jelinek wrote: > > > On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote: > > > This patch adds caching for the stack block allocated for > > > offloaded OpenMP kernel launches on NVPTX. This is a performance > > > optimisation -- we observed an average 11% or so performance > > > improvement with this patch across a set of accelerated GPU > > > benchmarks on one machine (results vary according to individual > > > benchmark and with hardware used). > > In this patch you're folding two changes together: reuse of allocated > stacks and removing one host-device synchronization. Why is that? > Can you report performance change separately for each change (and > split out the patches)? An accident of the development process of the patch, really -- the idea for removing the post-kernel-launch synchronisation came from the OpenACC side, and adapting it to OpenMP meant the stacks had to remain allocated after the return of the GOMP_OFFLOAD_run function. > > > A given kernel launch will reuse the stack block from the > > > previous launch if it is large enough, else it is freed and > > > reallocated. A slight caveat is that memory will not be freed > > > until the device is closed, so e.g. if code is using highly > > > variable launch geometries and large amounts of GPU RAM, you > > > might run out of resources slightly quicker with this patch. > > > > > > Another way this patch gains performance is by omitting the > > > synchronisation at the end of an OpenMP offload kernel launch -- > > > it's safe for the GPU and CPU to continue executing in parallel > > > at that point, because e.g. copies-back from the device will be > > > synchronised properly with kernel completion anyway. > > I don't think this explanation is sufficient. My understanding is > that OpenMP forbids the host to proceed asynchronously after the > target construct unless it is a 'target nowait' construct. This may > be observable if there's a printf in the target region for example > (or if it accesses memory via host pointers). > > So this really needs to be a separate patch with more explanation why > this is okay (if it is okay). As long as the offload kernel only touches GPU memory and does not have any CPU-visible side effects (like the printf you mentioned -- I hadn't really considered that, oops!), it's probably OK. But anyway, the benefit obtained on OpenMP code (the same set of benchmarks run before) of omitting the synchronisation at the end of GOMP_OFFLOAD_run seems minimal. So it's good enough to just do the stacks caching, and miss out the synchronisation removal for now. (It might still be something worth considering later, perhaps, as long as we can show some given kernel doesn't use printf or access memory via host pointers -- I guess the former might be easier than the latter. I have observed the equivalent OpenACC patch provide a significant boost on some benchmarks, so there's probably something that could be gained on the OpenMP side too.) The benefit with the attached patch -- just stacks caching, no synchronisation removal -- is about 12% on the same set of benchmarks as before. Results are a little noisy on the machine I'm benchmarking on, so this isn't necessarily proof that the synchronisation removal is harmful for performance! > > > In turn, the last part necessitates a change to the way "(perhaps > > > abort was called)" errors are detected and reported. > > As already mentioned using callbacks is problematic. Plus, I'm sure > the way you lock out other threads is a performance loss when > multiple threads have target regions: even though they will not run > concurrently on the GPU, you still want to allow host threads to > submit GPU jobs while the GPU is occupied. > > I would suggest to have a small pool (up to 3 entries perhaps) of > stacks. Then you can arrange reuse without totally serializing host > threads on target regions. I'm really wary of the additional complexity of adding a stack pool, and the memory allocation/freeing code paths in CUDA appear to be so slow that we get a benefit with this patch even when the GPU stream has to wait for the CPU to unlock the stacks block. Also, for large GPU launches, the size of the soft-stacks block isn't really trivial (I've seen something like 50MB on the hardware I'm using, with default options), and multiplying that by 3 could start to eat into the GPU heap memory for "useful data" quite significantly. Consider the attached (probably not amazingly-written) microbenchmark. It spawns 8 threads which each launch lots of OpenMP kernels performing some trivial work, then joins the threads and checks the results. As a baseline, with the "FEWER_KERNELS" parameters set (256 kernel launches over 8 threads), this gives us over 5 runs: real 3m55.375s user 7m14.192s sys 0m30.148s real 3m54.487s user 7m6.775s sys 0m34.678s real 3m54.633s user 7m20.381s sys 0m30.620s real 3m54.992s user 7m12.464s sys 0m29.610s real 3m55.471s user 7m14.342s sys 0m29.815s With a version of the attached patch, we instead get: real 3m53.404s user 3m39.869s sys 0m16.149s real 3m54.713s user 3m41.018s sys 0m16.129s real 3m55.242s user 3m55.148s sys 0m17.130s real 3m55.374s user 3m40.411s sys 0m15.818s real 3m55.189s user 3m40.144s sys 0m15.846s That is: real time is about the same, but user/sys time are reduced. Without FEWER_KERNELS (1048576 kernel launches over 8 threads), the baseline is: real 12m29.975s user 24m2.244s sys 8m8.153s real 12m15.391s user 23m51.018s sys 8m0.809s real 12m5.424s user 23m38.585s sys 7m47.714s real 12m10.456s user 23m51.691s sys 7m54.324s real 12m37.735s user 24m19.671s sys 8m15.752s And with the patch, we get: real 4m42.600s user 16m14.593s sys 0m40.444s real 4m43.579s user 15m33.805s sys 0m38.537s real 4m42.211s user 16m32.926s sys 0m40.271s real 4m44.256s user 15m49.290s sys 0m39.116s real 4m42.013s user 15m39.447s sys 0m38.517s Real, user and sys time are all dramatically less. So I'd suggest that the attached patch is an improvement over the status quo, even if we could experiment with the stacks pool idea as a further improvement later on. The attached patch also implements a size limit for retention of the soft-stack block -- freeing it before allocating more memory, rather than at the start of a kernel launch, so bigger blocks can still be shared between kernel launches if there's no memory allocation between them. It also tries freeing smaller cached soft-stack blocks and retrying memory allocation in out-of-memory situations. Re-tested with offloading to NVPTX. OK for trunk? Thanks, Julian ChangeLog 2020-11-13 Julian Brown <julian@codesourcery.com> libgomp/ * plugin/plugin-nvptx.c (SOFTSTACK_CACHE_LIMIT): New define. (struct ptx_device): Add omp_stacks struct. (nvptx_open_device): Initialise cached-stacks housekeeping info. (nvptx_close_device): Free cached stacks block and mutex. (nvptx_stacks_free): New function. (nvptx_alloc): Add SUPPRESS_ERRORS parameter. (GOMP_OFFLOAD_alloc): Add strategies for freeing soft-stacks block. (nvptx_stacks_alloc): Rename to... (nvptx_stacks_acquire): This. Cache stacks block between runs if same size or smaller is required. (nvptx_stacks_free): Remove. (GOMP_OFFLOAD_run): Call nvptx_stacks_acquire and lock stacks block during kernel execution.
Ping? Thanks, Julian On Fri, 13 Nov 2020 20:54:54 +0000 Julian Brown <julian@codesourcery.com> wrote: > Hi Alexander, > > Thanks for the review! Comments below. > > On Tue, 10 Nov 2020 00:32:36 +0300 > Alexander Monakov <amonakov@ispras.ru> wrote: > > > On Mon, 26 Oct 2020, Jakub Jelinek wrote: > > > > > On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote: > > > > This patch adds caching for the stack block allocated for > > > > offloaded OpenMP kernel launches on NVPTX. This is a performance > > > > optimisation -- we observed an average 11% or so performance > > > > improvement with this patch across a set of accelerated GPU > > > > benchmarks on one machine (results vary according to individual > > > > benchmark and with hardware used). > > > > In this patch you're folding two changes together: reuse of > > allocated stacks and removing one host-device synchronization. Why > > is that? Can you report performance change separately for each > > change (and split out the patches)? > > An accident of the development process of the patch, really -- the > idea for removing the post-kernel-launch synchronisation came from the > OpenACC side, and adapting it to OpenMP meant the stacks had to remain > allocated after the return of the GOMP_OFFLOAD_run function. > > > > > A given kernel launch will reuse the stack block from the > > > > previous launch if it is large enough, else it is freed and > > > > reallocated. A slight caveat is that memory will not be freed > > > > until the device is closed, so e.g. if code is using highly > > > > variable launch geometries and large amounts of GPU RAM, you > > > > might run out of resources slightly quicker with this patch. > > > > > > > > Another way this patch gains performance is by omitting the > > > > synchronisation at the end of an OpenMP offload kernel launch -- > > > > it's safe for the GPU and CPU to continue executing in parallel > > > > at that point, because e.g. copies-back from the device will be > > > > synchronised properly with kernel completion anyway. > > > > I don't think this explanation is sufficient. My understanding is > > that OpenMP forbids the host to proceed asynchronously after the > > target construct unless it is a 'target nowait' construct. This may > > be observable if there's a printf in the target region for example > > (or if it accesses memory via host pointers). > > > > So this really needs to be a separate patch with more explanation > > why this is okay (if it is okay). > > As long as the offload kernel only touches GPU memory and does not > have any CPU-visible side effects (like the printf you mentioned -- I > hadn't really considered that, oops!), it's probably OK. > > But anyway, the benefit obtained on OpenMP code (the same set of > benchmarks run before) of omitting the synchronisation at the end of > GOMP_OFFLOAD_run seems minimal. So it's good enough to just do the > stacks caching, and miss out the synchronisation removal for now. (It > might still be something worth considering later, perhaps, as long as > we can show some given kernel doesn't use printf or access memory via > host pointers -- I guess the former might be easier than the latter. I > have observed the equivalent OpenACC patch provide a significant boost > on some benchmarks, so there's probably something that could be gained > on the OpenMP side too.) > > The benefit with the attached patch -- just stacks caching, no > synchronisation removal -- is about 12% on the same set of benchmarks > as before. Results are a little noisy on the machine I'm benchmarking > on, so this isn't necessarily proof that the synchronisation removal > is harmful for performance! > > > > > In turn, the last part necessitates a change to the way > > > > "(perhaps abort was called)" errors are detected and reported. > > > > > > > > As already mentioned using callbacks is problematic. Plus, I'm sure > > the way you lock out other threads is a performance loss when > > multiple threads have target regions: even though they will not run > > concurrently on the GPU, you still want to allow host threads to > > submit GPU jobs while the GPU is occupied. > > > > I would suggest to have a small pool (up to 3 entries perhaps) of > > stacks. Then you can arrange reuse without totally serializing host > > threads on target regions. > > I'm really wary of the additional complexity of adding a stack pool, > and the memory allocation/freeing code paths in CUDA appear to be so > slow that we get a benefit with this patch even when the GPU stream > has to wait for the CPU to unlock the stacks block. Also, for large > GPU launches, the size of the soft-stacks block isn't really trivial > (I've seen something like 50MB on the hardware I'm using, with default > options), and multiplying that by 3 could start to eat into the GPU > heap memory for "useful data" quite significantly. > > Consider the attached (probably not amazingly-written) microbenchmark. > It spawns 8 threads which each launch lots of OpenMP kernels > performing some trivial work, then joins the threads and checks the > results. As a baseline, with the "FEWER_KERNELS" parameters set (256 > kernel launches over 8 threads), this gives us over 5 runs: > > real 3m55.375s > user 7m14.192s > sys 0m30.148s > > real 3m54.487s > user 7m6.775s > sys 0m34.678s > > real 3m54.633s > user 7m20.381s > sys 0m30.620s > > real 3m54.992s > user 7m12.464s > sys 0m29.610s > > real 3m55.471s > user 7m14.342s > sys 0m29.815s > > With a version of the attached patch, we instead get: > > real 3m53.404s > user 3m39.869s > sys 0m16.149s > > real 3m54.713s > user 3m41.018s > sys 0m16.129s > > real 3m55.242s > user 3m55.148s > sys 0m17.130s > > real 3m55.374s > user 3m40.411s > sys 0m15.818s > > real 3m55.189s > user 3m40.144s > sys 0m15.846s > > That is: real time is about the same, but user/sys time are reduced. > > Without FEWER_KERNELS (1048576 kernel launches over 8 threads), the > baseline is: > > real 12m29.975s > user 24m2.244s > sys 8m8.153s > > real 12m15.391s > user 23m51.018s > sys 8m0.809s > > real 12m5.424s > user 23m38.585s > sys 7m47.714s > > real 12m10.456s > user 23m51.691s > sys 7m54.324s > > real 12m37.735s > user 24m19.671s > sys 8m15.752s > > And with the patch, we get: > > real 4m42.600s > user 16m14.593s > sys 0m40.444s > > real 4m43.579s > user 15m33.805s > sys 0m38.537s > > real 4m42.211s > user 16m32.926s > sys 0m40.271s > > real 4m44.256s > user 15m49.290s > sys 0m39.116s > > real 4m42.013s > user 15m39.447s > sys 0m38.517s > > Real, user and sys time are all dramatically less. So I'd suggest that > the attached patch is an improvement over the status quo, even if we > could experiment with the stacks pool idea as a further improvement > later on. > > The attached patch also implements a size limit for retention of the > soft-stack block -- freeing it before allocating more memory, rather > than at the start of a kernel launch, so bigger blocks can still be > shared between kernel launches if there's no memory allocation between > them. It also tries freeing smaller cached soft-stack blocks and > retrying memory allocation in out-of-memory situations. > > Re-tested with offloading to NVPTX. OK for trunk? > > Thanks, > > Julian > > ChangeLog > > 2020-11-13 Julian Brown <julian@codesourcery.com> > > libgomp/ > * plugin/plugin-nvptx.c (SOFTSTACK_CACHE_LIMIT): New define. > (struct ptx_device): Add omp_stacks struct. > (nvptx_open_device): Initialise cached-stacks housekeeping info. > (nvptx_close_device): Free cached stacks block and mutex. > (nvptx_stacks_free): New function. > (nvptx_alloc): Add SUPPRESS_ERRORS parameter. > (GOMP_OFFLOAD_alloc): Add strategies for freeing soft-stacks > block. (nvptx_stacks_alloc): Rename to... > (nvptx_stacks_acquire): This. Cache stacks block between runs if > same size or smaller is required. > (nvptx_stacks_free): Remove. > (GOMP_OFFLOAD_run): Call nvptx_stacks_acquire and lock stacks > block during kernel execution.
On Tue, 8 Dec 2020, Julian Brown wrote: > Ping? This has addressed my concerns, thanks. Alexander > On Fri, 13 Nov 2020 20:54:54 +0000 > Julian Brown <julian@codesourcery.com> wrote: > > > Hi Alexander, > > > > Thanks for the review! Comments below. > > > > On Tue, 10 Nov 2020 00:32:36 +0300 > > Alexander Monakov <amonakov@ispras.ru> wrote: > > > > > On Mon, 26 Oct 2020, Jakub Jelinek wrote: > > > > > > > On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote: > > > > > This patch adds caching for the stack block allocated for > > > > > offloaded OpenMP kernel launches on NVPTX. This is a performance > > > > > optimisation -- we observed an average 11% or so performance > > > > > improvement with this patch across a set of accelerated GPU > > > > > benchmarks on one machine (results vary according to individual > > > > > benchmark and with hardware used). > > > > > > In this patch you're folding two changes together: reuse of > > > allocated stacks and removing one host-device synchronization. Why > > > is that? Can you report performance change separately for each > > > change (and split out the patches)? > > > > An accident of the development process of the patch, really -- the > > idea for removing the post-kernel-launch synchronisation came from the > > OpenACC side, and adapting it to OpenMP meant the stacks had to remain > > allocated after the return of the GOMP_OFFLOAD_run function. > > > > > > > A given kernel launch will reuse the stack block from the > > > > > previous launch if it is large enough, else it is freed and > > > > > reallocated. A slight caveat is that memory will not be freed > > > > > until the device is closed, so e.g. if code is using highly > > > > > variable launch geometries and large amounts of GPU RAM, you > > > > > might run out of resources slightly quicker with this patch. > > > > > > > > > > Another way this patch gains performance is by omitting the > > > > > synchronisation at the end of an OpenMP offload kernel launch -- > > > > > it's safe for the GPU and CPU to continue executing in parallel > > > > > at that point, because e.g. copies-back from the device will be > > > > > synchronised properly with kernel completion anyway. > > > > > > I don't think this explanation is sufficient. My understanding is > > > that OpenMP forbids the host to proceed asynchronously after the > > > target construct unless it is a 'target nowait' construct. This may > > > be observable if there's a printf in the target region for example > > > (or if it accesses memory via host pointers). > > > > > > So this really needs to be a separate patch with more explanation > > > why this is okay (if it is okay). > > > > As long as the offload kernel only touches GPU memory and does not > > have any CPU-visible side effects (like the printf you mentioned -- I > > hadn't really considered that, oops!), it's probably OK. > > > > But anyway, the benefit obtained on OpenMP code (the same set of > > benchmarks run before) of omitting the synchronisation at the end of > > GOMP_OFFLOAD_run seems minimal. So it's good enough to just do the > > stacks caching, and miss out the synchronisation removal for now. (It > > might still be something worth considering later, perhaps, as long as > > we can show some given kernel doesn't use printf or access memory via > > host pointers -- I guess the former might be easier than the latter. I > > have observed the equivalent OpenACC patch provide a significant boost > > on some benchmarks, so there's probably something that could be gained > > on the OpenMP side too.) > > > > The benefit with the attached patch -- just stacks caching, no > > synchronisation removal -- is about 12% on the same set of benchmarks > > as before. Results are a little noisy on the machine I'm benchmarking > > on, so this isn't necessarily proof that the synchronisation removal > > is harmful for performance! > > > > > > > In turn, the last part necessitates a change to the way > > > > > "(perhaps abort was called)" errors are detected and reported. > > > > > > > > > > > As already mentioned using callbacks is problematic. Plus, I'm sure > > > the way you lock out other threads is a performance loss when > > > multiple threads have target regions: even though they will not run > > > concurrently on the GPU, you still want to allow host threads to > > > submit GPU jobs while the GPU is occupied. > > > > > > I would suggest to have a small pool (up to 3 entries perhaps) of > > > stacks. Then you can arrange reuse without totally serializing host > > > threads on target regions. > > > > I'm really wary of the additional complexity of adding a stack pool, > > and the memory allocation/freeing code paths in CUDA appear to be so > > slow that we get a benefit with this patch even when the GPU stream > > has to wait for the CPU to unlock the stacks block. Also, for large > > GPU launches, the size of the soft-stacks block isn't really trivial > > (I've seen something like 50MB on the hardware I'm using, with default > > options), and multiplying that by 3 could start to eat into the GPU > > heap memory for "useful data" quite significantly. > > > > Consider the attached (probably not amazingly-written) microbenchmark. > > It spawns 8 threads which each launch lots of OpenMP kernels > > performing some trivial work, then joins the threads and checks the > > results. As a baseline, with the "FEWER_KERNELS" parameters set (256 > > kernel launches over 8 threads), this gives us over 5 runs: > > > > real 3m55.375s > > user 7m14.192s > > sys 0m30.148s > > > > real 3m54.487s > > user 7m6.775s > > sys 0m34.678s > > > > real 3m54.633s > > user 7m20.381s > > sys 0m30.620s > > > > real 3m54.992s > > user 7m12.464s > > sys 0m29.610s > > > > real 3m55.471s > > user 7m14.342s > > sys 0m29.815s > > > > With a version of the attached patch, we instead get: > > > > real 3m53.404s > > user 3m39.869s > > sys 0m16.149s > > > > real 3m54.713s > > user 3m41.018s > > sys 0m16.129s > > > > real 3m55.242s > > user 3m55.148s > > sys 0m17.130s > > > > real 3m55.374s > > user 3m40.411s > > sys 0m15.818s > > > > real 3m55.189s > > user 3m40.144s > > sys 0m15.846s > > > > That is: real time is about the same, but user/sys time are reduced. > > > > Without FEWER_KERNELS (1048576 kernel launches over 8 threads), the > > baseline is: > > > > real 12m29.975s > > user 24m2.244s > > sys 8m8.153s > > > > real 12m15.391s > > user 23m51.018s > > sys 8m0.809s > > > > real 12m5.424s > > user 23m38.585s > > sys 7m47.714s > > > > real 12m10.456s > > user 23m51.691s > > sys 7m54.324s > > > > real 12m37.735s > > user 24m19.671s > > sys 8m15.752s > > > > And with the patch, we get: > > > > real 4m42.600s > > user 16m14.593s > > sys 0m40.444s > > > > real 4m43.579s > > user 15m33.805s > > sys 0m38.537s > > > > real 4m42.211s > > user 16m32.926s > > sys 0m40.271s > > > > real 4m44.256s > > user 15m49.290s > > sys 0m39.116s > > > > real 4m42.013s > > user 15m39.447s > > sys 0m38.517s > > > > Real, user and sys time are all dramatically less. So I'd suggest that > > the attached patch is an improvement over the status quo, even if we > > could experiment with the stacks pool idea as a further improvement > > later on. > > > > The attached patch also implements a size limit for retention of the > > soft-stack block -- freeing it before allocating more memory, rather > > than at the start of a kernel launch, so bigger blocks can still be > > shared between kernel launches if there's no memory allocation between > > them. It also tries freeing smaller cached soft-stack blocks and > > retrying memory allocation in out-of-memory situations. > > > > Re-tested with offloading to NVPTX. OK for trunk? > > > > Thanks, > > > > Julian > > > > ChangeLog > > > > 2020-11-13 Julian Brown <julian@codesourcery.com> > > > > libgomp/ > > * plugin/plugin-nvptx.c (SOFTSTACK_CACHE_LIMIT): New define. > > (struct ptx_device): Add omp_stacks struct. > > (nvptx_open_device): Initialise cached-stacks housekeeping info. > > (nvptx_close_device): Free cached stacks block and mutex. > > (nvptx_stacks_free): New function. > > (nvptx_alloc): Add SUPPRESS_ERRORS parameter. > > (GOMP_OFFLOAD_alloc): Add strategies for freeing soft-stacks > > block. (nvptx_stacks_alloc): Rename to... > > (nvptx_stacks_acquire): This. Cache stacks block between runs if > > same size or smaller is required. > > (nvptx_stacks_free): Remove. > > (GOMP_OFFLOAD_run): Call nvptx_stacks_acquire and lock stacks > > block during kernel execution. >
On Tue, 8 Dec 2020 20:11:38 +0300 Alexander Monakov <amonakov@ispras.ru> wrote: > On Tue, 8 Dec 2020, Julian Brown wrote: > > > Ping? > > This has addressed my concerns, thanks. Jakub, Tom -- just to confirm, is this OK for trunk now? I noticed a slight bugfix myself in the no-stacks/out-of-memory case -- i.e. for OpenACC, in nvptx_stacks_free. The attached version of the patch includes that fix. Thanks, Julian
On Tue, Dec 15, 2020 at 01:39:13PM +0000, Julian Brown wrote: > @@ -1922,7 +1997,9 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args) > nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads); > > size_t stack_size = nvptx_stacks_size (); > - void *stacks = nvptx_stacks_alloc (stack_size, teams * threads); > + > + pthread_mutex_lock (&ptx_dev->omp_stacks.lock); > + void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads); > void *fn_args[] = {tgt_vars, stacks, (void *) stack_size}; > size_t fn_args_size = sizeof fn_args; > void *config[] = { > @@ -1944,7 +2021,8 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args) > maybe_abort_msg); > else if (r != CUDA_SUCCESS) > GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r)); > - nvptx_stacks_free (stacks, teams * threads); > + > + pthread_mutex_unlock (&ptx_dev->omp_stacks.lock); > } Do you need to hold the omp_stacks.lock across the entire offloading? Doesn't that serialize all offloading kernels to the same device? I mean, can't the lock be taken just shortly at the start to either acquire the cached stacks or allocate a fresh stack, and then at the end to put the stack back into the cache? Also, how will this caching interact with malloc etc. performed in target regions? Shall we do the caching only if there is no other concurrent offloading to the device because the newlib malloc will not be able to figure out it could free this and let the host know it has freed it. Jakub
On Tue, 15 Dec 2020 14:49:40 +0100 Jakub Jelinek <jakub@redhat.com> wrote: > On Tue, Dec 15, 2020 at 01:39:13PM +0000, Julian Brown wrote: > > @@ -1922,7 +1997,9 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void > > *tgt_vars, void **args) nvptx_adjust_launch_bounds (tgt_fn, > > ptx_dev, &teams, &threads); > > size_t stack_size = nvptx_stacks_size (); > > - void *stacks = nvptx_stacks_alloc (stack_size, teams * threads); > > + > > + pthread_mutex_lock (&ptx_dev->omp_stacks.lock); > > + void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams > > * threads); void *fn_args[] = {tgt_vars, stacks, (void *) > > stack_size}; size_t fn_args_size = sizeof fn_args; > > void *config[] = { > > @@ -1944,7 +2021,8 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void > > *tgt_vars, void **args) maybe_abort_msg); > > else if (r != CUDA_SUCCESS) > > GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error > > (r)); > > - nvptx_stacks_free (stacks, teams * threads); > > + > > + pthread_mutex_unlock (&ptx_dev->omp_stacks.lock); > > } > > Do you need to hold the omp_stacks.lock across the entire offloading? > Doesn't that serialize all offloading kernels to the same device? > I mean, can't the lock be taken just shortly at the start to either > acquire the cached stacks or allocate a fresh stack, and then at the > end to put the stack back into the cache? I think you're suggesting something like what Alexander mentioned -- a pool of cached stacks blocks in case the single, locked block is contested. Obviously at present kernel launches are serialised on the target anyway, so it's a question of whether having the device wait for the host to unlock the stacks block (i.e. a context switch, FSVO context switch), or allocating a new stacks block, is quicker. I think the numbers posted in the parent email show that memory allocation is so slow that just waiting for the lock wins. I'm wary of adding unnecessary complication, especially if it'll only be exercised in already hard-to-debug cases (i.e. lots of threads)! Just ignoring the cache if it's "in use" (and doing an allocation/free of another stacks block, as at present) is something I'd not quite considered. Indeed that might work, but I'm not sure if it'll be any faster in practice. > Also, how will this caching interact with malloc etc. performed in > target regions? Shall we do the caching only if there is no other > concurrent offloading to the device because the newlib malloc will > not be able to figure out it could free this and let the host know it > has freed it. Does target-side memory allocation call back into the plugin's GOMP_OFFLOAD_alloc? I'm not sure how that works. If not, target-side memory allocation shouldn't be affected, I don't think? Thanks, Julian
On Tue, Dec 15, 2020 at 04:49:38PM +0000, Julian Brown wrote: > > Do you need to hold the omp_stacks.lock across the entire offloading? > > Doesn't that serialize all offloading kernels to the same device? > > I mean, can't the lock be taken just shortly at the start to either > > acquire the cached stacks or allocate a fresh stack, and then at the > > end to put the stack back into the cache? > > I think you're suggesting something like what Alexander mentioned -- a > pool of cached stacks blocks in case the single, locked block is > contested. Obviously at present kernel launches are serialised on the > target anyway, so it's a question of whether having the device wait for > the host to unlock the stacks block (i.e. a context switch, FSVO context > switch), or allocating a new stacks block, is quicker. I think the > numbers posted in the parent email show that memory allocation is so > slow that just waiting for the lock wins. I'm wary of adding > unnecessary complication, especially if it'll only be exercised in > already hard-to-debug cases (i.e. lots of threads)! I'm not suggesting to have multiple stacks, on the contrary. I've suggested to do the caching only if at most one host thread is offloading to the device. If one uses #pragma omp parallel num_threads(3) { #pragma omp target ... } then I don't see what would previously prevent the concurrent offloading, yes, we take the device lock during gomp_map_vars and again during gomp_unmap_vars, but don't hold it across the offloading in between. > Does target-side memory allocation call back into the plugin's > GOMP_OFFLOAD_alloc? I'm not sure how that works. If not, target-side > memory allocation shouldn't be affected, I don't think? Again, I'm not suggesting that it should, but what I'm saying is that if target region ends but some other host tasks are doing target regions to the same device concurrently with that, or if there are async target in fly, we shouldn't try to cache the stack, but free it right away, because what the other target regions might need to malloc larger amounts of memory and fail because of the caching. Jakub
On Tue, 15 Dec 2020 18:00:36 +0100 Jakub Jelinek <jakub@redhat.com> wrote: > On Tue, Dec 15, 2020 at 04:49:38PM +0000, Julian Brown wrote: > > > Do you need to hold the omp_stacks.lock across the entire > > > offloading? Doesn't that serialize all offloading kernels to the > > > same device? I mean, can't the lock be taken just shortly at the > > > start to either acquire the cached stacks or allocate a fresh > > > stack, and then at the end to put the stack back into the cache? > > > > I think you're suggesting something like what Alexander mentioned > > -- a pool of cached stacks blocks in case the single, locked block > > is contested. Obviously at present kernel launches are serialised > > on the target anyway, so it's a question of whether having the > > device wait for the host to unlock the stacks block (i.e. a context > > switch, FSVO context switch), or allocating a new stacks block, is > > quicker. I think the numbers posted in the parent email show that > > memory allocation is so slow that just waiting for the lock wins. > > I'm wary of adding unnecessary complication, especially if it'll > > only be exercised in already hard-to-debug cases (i.e. lots of > > threads)! > > I'm not suggesting to have multiple stacks, on the contrary. I've > suggested to do the caching only if at most one host thread is > offloading to the device. > > If one uses > #pragma omp parallel num_threads(3) > { > #pragma omp target > ... > } > then I don't see what would previously prevent the concurrent > offloading, yes, we take the device lock during gomp_map_vars and > again during gomp_unmap_vars, but don't hold it across the offloading > in between. I still don't think I quite understand what you're getting at. We only implement synchronous launches for OpenMP on NVPTX at present, and those all use the default CUDA runtime driver stream. Only one kernel executes on the hardware at once, even if launched from different host threads. The serialisation isn't due to the device lock being held, but by the queueing semantics of the underlying API. > > Does target-side memory allocation call back into the plugin's > > GOMP_OFFLOAD_alloc? I'm not sure how that works. If not, target-side > > memory allocation shouldn't be affected, I don't think? > > Again, I'm not suggesting that it should, but what I'm saying is that > if target region ends but some other host tasks are doing target > regions to the same device concurrently with that, or if there are > async target in fly, we shouldn't try to cache the stack, but free it > right away, because what the other target regions might need to > malloc larger amounts of memory and fail because of the caching. I'm assuming you're not suggesting fundamentally changing APIs or anything to determine if we're launching target regions from multiple threads at once, but instead that we try to detect the condition dynamically in the plugin? So, would kernel launch look something like this? (Excuse pseudo-code-isms!) void GOMP_OFFLOAD_run (...) { bool used_cache; pthread_mutex_lock (&ptx_dev->omp_stacks.lock); if (&ptx_dev->omp_stacks.usage_count > 0) { cuCtxSynchronize (); nvptx_stacks_free (&ptx_dev); ...allocate fresh stack, no caching... used_cache = false; } else { /* Allocate or re-use cached stacks, and then... */ ptx_dev->omp_stacks.usage_count++; used_cache = true; } pthread_mutex_unlock (&ptx_dev->omp_stacks.lock); /* Launch kernel */ if (used_cache) { cuStreamAddCallback ( pthread_mutex_lock (&ptx_dev->omp_stacks.lock); ptx_dev->omp_stacks.usage_count--; pthread_mutex_unlock (&ptx_dev->omp_stacks.lock); ); } else { pthread_mutex_lock (&ptx_dev->omp_stacks.lock); /* Free uncached stack */ pthread_mutex_unlock (&ptx_dev->omp_stacks.lock); } } This seems like it'd be rather fragile to me, and would offer some benefit perhaps only if a previous cached stacks block was much larger than the one required for some given later launch. It wouldn't allow any additional parallelism on the target I don't think. Is that sort-of what you meant? Oh, or perhaps something more like checking cuStreamQuery at the end of the kernel launch to see if more work (...from other threads) is outstanding on the same queue? I think that only usefully returns CUDA_SUCCESS/CUDA_ERROR_NOT_READY, so I'm not sure if that'd help. Thanks for clarification (& apologies for being slow!), Julian
Hi Jakub, Just to check, does my reply below address your concerns -- particularly with regards to the current usage of CUDA streams serializing kernel executions from different host threads? Given that situation, and the observed speed improvement with OpenMP offloading to NVPTX with the patch, I'm not sure how much sense it makes to do anything more sophisticated than this -- especially without a test case that demonstrates a performance regression (or an exacerbated out-of-memory condition) with the patch. Thanks, Julian On Tue, 15 Dec 2020 23:16:48 +0000 Julian Brown <julian@codesourcery.com> wrote: > On Tue, 15 Dec 2020 18:00:36 +0100 > Jakub Jelinek <jakub@redhat.com> wrote: > > > On Tue, Dec 15, 2020 at 04:49:38PM +0000, Julian Brown wrote: > > > > Do you need to hold the omp_stacks.lock across the entire > > > > offloading? Doesn't that serialize all offloading kernels to the > > > > same device? I mean, can't the lock be taken just shortly at the > > > > start to either acquire the cached stacks or allocate a fresh > > > > stack, and then at the end to put the stack back into the > > > > cache? > > > > > > I think you're suggesting something like what Alexander mentioned > > > -- a pool of cached stacks blocks in case the single, locked block > > > is contested. Obviously at present kernel launches are serialised > > > on the target anyway, so it's a question of whether having the > > > device wait for the host to unlock the stacks block (i.e. a > > > context switch, FSVO context switch), or allocating a new stacks > > > block, is quicker. I think the numbers posted in the parent email > > > show that memory allocation is so slow that just waiting for the > > > lock wins. I'm wary of adding unnecessary complication, > > > especially if it'll only be exercised in already hard-to-debug > > > cases (i.e. lots of threads)! > > > > I'm not suggesting to have multiple stacks, on the contrary. I've > > suggested to do the caching only if at most one host thread is > > offloading to the device. > > > > If one uses > > #pragma omp parallel num_threads(3) > > { > > #pragma omp target > > ... > > } > > then I don't see what would previously prevent the concurrent > > offloading, yes, we take the device lock during gomp_map_vars and > > again during gomp_unmap_vars, but don't hold it across the > > offloading in between. > > I still don't think I quite understand what you're getting at. > > We only implement synchronous launches for OpenMP on NVPTX at present, > and those all use the default CUDA runtime driver stream. Only one > kernel executes on the hardware at once, even if launched from > different host threads. The serialisation isn't due to the device lock > being held, but by the queueing semantics of the underlying API. > > > > Does target-side memory allocation call back into the plugin's > > > GOMP_OFFLOAD_alloc? I'm not sure how that works. If not, > > > target-side memory allocation shouldn't be affected, I don't > > > think? > > > > Again, I'm not suggesting that it should, but what I'm saying is > > that if target region ends but some other host tasks are doing > > target regions to the same device concurrently with that, or if > > there are async target in fly, we shouldn't try to cache the stack, > > but free it right away, because what the other target regions might > > need to malloc larger amounts of memory and fail because of the > > caching. > > I'm assuming you're not suggesting fundamentally changing APIs or > anything to determine if we're launching target regions from multiple > threads at once, but instead that we try to detect the condition > dynamically in the plugin? > > So, would kernel launch look something like this? (Excuse > pseudo-code-isms!) > > void GOMP_OFFLOAD_run (...) > { > bool used_cache; > > pthread_mutex_lock (&ptx_dev->omp_stacks.lock); > if (&ptx_dev->omp_stacks.usage_count > 0) > { > cuCtxSynchronize (); > nvptx_stacks_free (&ptx_dev); > ...allocate fresh stack, no caching... > used_cache = false; > } > else > { > /* Allocate or re-use cached stacks, and then... */ > ptx_dev->omp_stacks.usage_count++; > used_cache = true; > } > pthread_mutex_unlock (&ptx_dev->omp_stacks.lock); > > /* Launch kernel */ > > if (used_cache) { > cuStreamAddCallback ( > pthread_mutex_lock (&ptx_dev->omp_stacks.lock); > ptx_dev->omp_stacks.usage_count--; > pthread_mutex_unlock (&ptx_dev->omp_stacks.lock); > ); > } else { > pthread_mutex_lock (&ptx_dev->omp_stacks.lock); > /* Free uncached stack */ > pthread_mutex_unlock (&ptx_dev->omp_stacks.lock); > } > } > > This seems like it'd be rather fragile to me, and would offer some > benefit perhaps only if a previous cached stacks block was much larger > than the one required for some given later launch. It wouldn't allow > any additional parallelism on the target I don't think. > > Is that sort-of what you meant? > > Oh, or perhaps something more like checking cuStreamQuery at the end > of the kernel launch to see if more work (...from other threads) is > outstanding on the same queue? I think that only usefully returns > CUDA_SUCCESS/CUDA_ERROR_NOT_READY, so I'm not sure if that'd help. > > Thanks for clarification (& apologies for being slow!), > > Julian
On Tue, Jan 05, 2021 at 12:13:59PM +0000, Julian Brown wrote: > Just to check, does my reply below address your concerns -- > particularly with regards to the current usage of CUDA streams > serializing kernel executions from different host threads? Given that > situation, and the observed speed improvement with OpenMP offloading to > NVPTX with the patch, I'm not sure how much sense it makes to do > anything more sophisticated than this -- especially without a test case > that demonstrates a performance regression (or an exacerbated > out-of-memory condition) with the patch. I guess I can live with it for GCC 11, but would like this to be reconsidered for GCC 12, people do run OpenMP offloading code from multiple often concurrent threads and we shouldn't serialize it unnecessarily. Jakub
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c index 11d4ceeae62e..e7ff5d5213e0 100644 --- a/libgomp/plugin/plugin-nvptx.c +++ b/libgomp/plugin/plugin-nvptx.c @@ -137,6 +137,15 @@ init_cuda_lib (void) #define MIN(X,Y) ((X) < (Y) ? (X) : (Y)) #define MAX(X,Y) ((X) > (Y) ? (X) : (Y)) +static const char * +maybe_abort_message (unsigned errmsg) +{ + if (errmsg == CUDA_ERROR_LAUNCH_FAILED) + return " (perhaps abort was called)"; + else + return ""; +} + /* Convenience macros for the frequently used CUDA library call and error handling sequence as well as CUDA library calls that do the error checking themselves or don't do it at all. */ @@ -147,8 +156,9 @@ init_cuda_lib (void) = CUDA_CALL_PREFIX FN (__VA_ARGS__); \ if (__r != CUDA_SUCCESS) \ { \ - GOMP_PLUGIN_error (#FN " error: %s", \ - cuda_error (__r)); \ + GOMP_PLUGIN_error (#FN " error: %s%s", \ + cuda_error (__r), \ + maybe_abort_message (__r)); \ return ERET; \ } \ } while (0) @@ -162,8 +172,9 @@ init_cuda_lib (void) = CUDA_CALL_PREFIX FN (__VA_ARGS__); \ if (__r != CUDA_SUCCESS) \ { \ - GOMP_PLUGIN_fatal (#FN " error: %s", \ - cuda_error (__r)); \ + GOMP_PLUGIN_fatal (#FN " error: %s%s", \ + cuda_error (__r), \ + maybe_abort_message (__r)); \ } \ } while (0) @@ -307,6 +318,14 @@ struct ptx_device struct ptx_free_block *free_blocks; pthread_mutex_t free_blocks_lock; + /* OpenMP stacks, cached between kernel invocations. */ + struct + { + CUdeviceptr ptr; + size_t size; + pthread_mutex_t lock; + } omp_stacks; + struct ptx_device *next; }; @@ -514,6 +533,10 @@ nvptx_open_device (int n) ptx_dev->free_blocks = NULL; pthread_mutex_init (&ptx_dev->free_blocks_lock, NULL); + ptx_dev->omp_stacks.ptr = 0; + ptx_dev->omp_stacks.size = 0; + pthread_mutex_init (&ptx_dev->omp_stacks.lock, NULL); + return ptx_dev; } @@ -534,6 +557,11 @@ nvptx_close_device (struct ptx_device *ptx_dev) pthread_mutex_destroy (&ptx_dev->free_blocks_lock); pthread_mutex_destroy (&ptx_dev->image_lock); + pthread_mutex_destroy (&ptx_dev->omp_stacks.lock); + + if (ptx_dev->omp_stacks.ptr) + CUDA_CALL (cuMemFree, ptx_dev->omp_stacks.ptr); + if (!ptx_dev->ctx_shared) CUDA_CALL (cuCtxDestroy, ptx_dev->ctx); @@ -1866,26 +1894,49 @@ nvptx_stacks_size () return 128 * 1024; } -/* Return contiguous storage for NUM stacks, each SIZE bytes. */ +/* Return contiguous storage for NUM stacks, each SIZE bytes, and obtain the + lock for that storage. */ static void * -nvptx_stacks_alloc (size_t size, int num) +nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num) { - CUdeviceptr stacks; - CUresult r = CUDA_CALL_NOCHECK (cuMemAlloc, &stacks, size * num); + pthread_mutex_lock (&ptx_dev->omp_stacks.lock); + + if (ptx_dev->omp_stacks.ptr && ptx_dev->omp_stacks.size >= size * num) + return (void *) ptx_dev->omp_stacks.ptr; + + /* Free the old, too-small stacks. */ + if (ptx_dev->omp_stacks.ptr) + { + CUresult r = CUDA_CALL_NOCHECK (cuCtxSynchronize, ); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s\n", cuda_error (r)); + r = CUDA_CALL_NOCHECK (cuMemFree, ptx_dev->omp_stacks.ptr); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemFree error: %s", cuda_error (r)); + } + + /* Make new and bigger stacks, and remember where we put them and how big + they are. */ + CUresult r = CUDA_CALL_NOCHECK (cuMemAlloc, &ptx_dev->omp_stacks.ptr, + size * num); if (r != CUDA_SUCCESS) GOMP_PLUGIN_fatal ("cuMemAlloc error: %s", cuda_error (r)); - return (void *) stacks; + + ptx_dev->omp_stacks.size = size * num; + + return (void *) ptx_dev->omp_stacks.ptr; } -/* Release storage previously allocated by nvptx_stacks_alloc. */ +/* Release the lock associated with a ptx_device's OpenMP stacks block. */ static void -nvptx_stacks_free (void *p, int num) +nvptx_stacks_release (CUstream stream, CUresult res, void *ptr) { - CUresult r = CUDA_CALL_NOCHECK (cuMemFree, (CUdeviceptr) p); - if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuMemFree error: %s", cuda_error (r)); + if (res != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res)); + struct ptx_device *ptx_dev = (struct ptx_device *) ptr; + pthread_mutex_unlock (&ptx_dev->omp_stacks.lock); } void @@ -1898,7 +1949,6 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args) const char *fn_name = launch->fn; CUresult r; struct ptx_device *ptx_dev = ptx_devices[ord]; - const char *maybe_abort_msg = "(perhaps abort was called)"; int teams = 0, threads = 0; if (!args) @@ -1922,7 +1972,7 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args) nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads); size_t stack_size = nvptx_stacks_size (); - void *stacks = nvptx_stacks_alloc (stack_size, teams * threads); + void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads); void *fn_args[] = {tgt_vars, stacks, (void *) stack_size}; size_t fn_args_size = sizeof fn_args; void *config[] = { @@ -1938,13 +1988,8 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args) if (r != CUDA_SUCCESS) GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r)); - r = CUDA_CALL_NOCHECK (cuCtxSynchronize, ); - if (r == CUDA_ERROR_LAUNCH_FAILED) - GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r), - maybe_abort_msg); - else if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r)); - nvptx_stacks_free (stacks, teams * threads); + CUDA_CALL_ASSERT (cuStreamAddCallback, NULL, nvptx_stacks_release, + (void *) ptx_dev, 0); } /* TODO: Implement GOMP_OFFLOAD_async_run. */