Message ID | 87zgan6eug.fsf@euler.schwinge.homeip.net |
---|---|
State | New |
Headers | show |
Series | nvptx: Avoid deadlock in 'cuStreamAddCallback' callback, error case (was: [PATCH 6/6, OpenACC, libgomp] Async re-work, nvptx changes) | expand |
Hi Thomas, On 2023/1/12 9:51 PM, Thomas Schwinge wrote: > In my case, 'cuda_callback_wrapper' (expectedly) gets invoked with > 'res != CUDA_SUCCESS' ("an illegal memory access was encountered"). > When we invoke 'GOMP_PLUGIN_fatal', this attempts to shut down the device > (..., which deadlocks); that's generally problematic: per > https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483 > "'cuStreamAddCallback' [...] Callbacks must not make any CUDA API calls". I remember running into this myself when first creating this async support (IIRC in my case it was cuFree()-ing something) yet you've found another mistake here! :) > Given that eventually we must reach a host/device synchronization point > (latest when the device is shut down at program termination), and the > non-'CUDA_SUCCESS' will be upheld until then, it does seem safe to > replace this 'GOMP_PLUGIN_fatal' with 'GOMP_PLUGIN_error' as per the > "nvptx: Avoid deadlock in 'cuStreamAddCallback' callback, error case" > attached. OK to push? I think this patch is fine. Actual approval powers are your's or Tom's :) > > (Might we even skip 'GOMP_PLUGIN_error' here, understanding that the > error will be caught and reported at the next host/device synchronization > point? But I've not verified that.) Actually, the CUDA driver API docs are a bit vague on what exactly this CUresult arg to the callback actually means. The 'res != CUDA_SUCCESS' handling here was basically just generic handling. I am not really sure what is the true right thing to do here (is the error still retained by CUDA after the callback completes?) Chung-Lin
Hi! On 2023-01-13T21:17:43+0800, Chung-Lin Tang <chunglin.tang@siemens.com> wrote: > On 2023/1/12 9:51 PM, Thomas Schwinge wrote: >> In my case, 'cuda_callback_wrapper' (expectedly) gets invoked with >> 'res != CUDA_SUCCESS' ("an illegal memory access was encountered"). >> When we invoke 'GOMP_PLUGIN_fatal', this attempts to shut down the device >> (..., which deadlocks); that's generally problematic: per >> https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483 >> "'cuStreamAddCallback' [...] Callbacks must not make any CUDA API calls". > > I remember running into this myself when first creating this async support > (IIRC in my case it was cuFree()-ing something) yet you've found another mistake here! :) ;-) >> Given that eventually we must reach a host/device synchronization point >> (latest when the device is shut down at program termination), and the >> non-'CUDA_SUCCESS' will be upheld until then, it does seem safe to >> replace this 'GOMP_PLUGIN_fatal' with 'GOMP_PLUGIN_error' as per the >> "nvptx: Avoid deadlock in 'cuStreamAddCallback' callback, error case" >> attached. OK to push? > > I think this patch is fine. Actual approval powers are your's or Tom's :) ACK. I'll let it sit for some more time before 'git push'. >> (Might we even skip 'GOMP_PLUGIN_error' here, understanding that the >> error will be caught and reported at the next host/device synchronization >> point? But I've not verified that.) > > Actually, the CUDA driver API docs are a bit vague on what exactly this > CUresult arg to the callback actually means. The 'res != CUDA_SUCCESS' handling > here was basically just generic handling. I suppose this really is just for its own use: for example, skip certain things in presence of pre-existing error? > I am not really sure what is the > true right thing to do here (is the error still retained by CUDA after the callback > completes?) Indeed the latter is what I do observe: GOMP_OFFLOAD_openacc_async_exec: prepare mappings nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=1, vectors=32 nvptx_exec: kernel main$_omp_fn$0: finished libgomp: cuMemcpyDtoHAsync_v2 error: an illegal memory access was encountered libgomp: libgomp: Copying of dev object [0x7f9a45000000..0x7f9a45000028) to host object [0x1d89350..0x1d89378) failed cuda_callback_wrapper error: an illegal memory access was encountered libgomp: cuStreamDestroy error: an illegal memory access was encountered libgomp: cuMemFree_v2 error: an illegal memory access was encountered libgomp: device finalization failed Here, after the 'async' OpenACC 'parallel' a 'copyout' gets enqueued, thus 'cuMemcpyDtoHAsync_v2', which is where we first get the device-side fault reported (all as expected). Then -- CUDA-internally multi-threaded, I suppose (thus the mangled printing) -- we print the 'Copying [...] failed' error plus get 'cuda_callback_wrapper' invoked. This receives the previous 'CUresult' as seen, and then the error is still visible at device shut-down, as shown by the following reports. (This makes sense, as the 'CUcontext' does not magically recover.) Also, per <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483>, "In the event of a device error, all subsequently executed callbacks will receive an appropriate 'CUresult'". But again: I'm perfectly fine with the repeated error reporting. Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
From b7ddcc0807967750e3c884326ed4c53c05cde81f Mon Sep 17 00:00:00 2001 From: Thomas Schwinge <thomas@codesourcery.com> Date: Thu, 12 Jan 2023 14:39:46 +0100 Subject: [PATCH] nvptx: Avoid deadlock in 'cuStreamAddCallback' callback, error case When we invoke 'GOMP_PLUGIN_fatal', this attempts to shut down the device (..., which may deadlock); that's generally problematic: per <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483> "'cuStreamAddCallback' [...] Callbacks must not make any CUDA API calls". Given that eventually we must reach a host/device synchronization point (latest when the device is shut down at program termination), and the non-'CUDA_SUCCESS' will be upheld until then, it does seem safe to replace this 'GOMP_PLUGIN_fatal' with 'GOMP_PLUGIN_error'. libgomp/ * plugin/plugin-nvptx.c (cuda_callback_wrapper): Invoke 'GOMP_PLUGIN_error' instead of 'GOMP_PLUGIN_fatal'. --- libgomp/plugin/plugin-nvptx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c index 395639537e83..cdb3d435bdc8 100644 --- a/libgomp/plugin/plugin-nvptx.c +++ b/libgomp/plugin/plugin-nvptx.c @@ -1927,7 +1927,7 @@ static void cuda_callback_wrapper (CUstream stream, CUresult res, void *ptr) { if (res != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res)); + GOMP_PLUGIN_error ("%s error: %s", __FUNCTION__, cuda_error (res)); struct nvptx_callback *cb = (struct nvptx_callback *) ptr; cb->fn (cb->ptr); free (ptr); -- 2.39.0