Message ID | 87sf12dxon.fsf@euler.schwinge.ddns.net |
---|---|
State | New |
Headers | show |
Series | nvptx: 'cuDeviceGetCount' failure is fatal (was: [Patch] OpenMP: Move omp requires checks to libgomp) | expand |
Hi Thomas, Thomas Schwinge wrote: >> /* Return the number of GCN devices on the system. */ >> int >> -GOMP_OFFLOAD_get_num_devices (void) >> +GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask) >> { >> if (!init_hsa_context ()) >> return 0; >> + /* Return -1 if no omp_requires_mask cannot be fulfilled but >> + devices were present. */ >> + if (hsa_context.agent_count > 0 && omp_requires_mask != 0) >> + return -1; >> return hsa_context.agent_count; >> } ... > OK to push the attached "nvptx: 'cuDeviceGetCount' failure is fatal"? I think the real question is: what does a 'cuDeviceGetCount' fail mean? Does it mean a serious error – or could it just be a permissions issue such that the user has no device access but otherwise is fine? Because if it is, e.g., a permission problem – just returning '0' (no devices) would seem to be the proper solution. But if it is expected to be always something serious, well, then a fatal error makes more sense. The possible exit codes are: CUDA_SUCCESS, CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_NOT_INITIALIZED, CUDA_ERROR_INVALID_CONTEXT, CUDA_ERROR_INVALID_VALUE which does not really help. My impression is that 0 is usually returned if something goes wrong (e.g. with permissions) such that an error is a real exception. But all three choices seem to make about equally sense: either host fallback (with 0 or -1) or a fatal error. Tobias
Hi Tobias! On 2024-03-07T15:28:21+0100, Tobias Burnus <tburnus@baylibre.com> wrote: > Thomas Schwinge wrote: >> OK to push the attached "nvptx: 'cuDeviceGetCount' failure is fatal"? > > I think the real question is: what does a 'cuDeviceGetCount' fail mean? Internally to the CUDA stack: the error codes that you've cited below. Per the state we're in when calling 'cuDeviceGetCount', we only expect 'CUDA_SUCCESS'. Therefore, in our actual use: anything else means a fatal condition that we don't attempt to recover from, like for most of all other device access failures. > Does it mean a serious error – or could it just be a permissions issue > such that the user has no device access but otherwise is fine? As you can see, we've done a 'cuInit' right before, so in case there was any permission issue (or similar), that's already settled (in whichever way) by the time we do the 'cuDeviceGetCount'. > Because if it is, e.g., a permission problem – just returning '0' (no > devices) would seem to be the proper solution. > > But if it is expected to be always something serious, well, then a fatal > error makes more sense. ACK; pushed in commit 37078f241a22c45db6380c5e9a79b4d08054bb3d. Grüße Thomas > The possible exit codes are: > > CUDA_SUCCESS, CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_NOT_INITIALIZED, > CUDA_ERROR_INVALID_CONTEXT, CUDA_ERROR_INVALID_VALUE > > which does not really help. > > My impression is that 0 is usually returned if something goes wrong > (e.g. with permissions) such that an error is a real exception. But all > three choices seem to make about equally sense: either host fallback > (with 0 or -1) or a fatal error. > > Tobias
From 8090da93cb00e4aa47a8b21b6548d739b2cebc49 Mon Sep 17 00:00:00 2001 From: Thomas Schwinge <tschwinge@baylibre.com> Date: Thu, 7 Mar 2024 13:18:23 +0100 Subject: [PATCH] nvptx: 'cuDeviceGetCount' failure is fatal Per commit 683f11843974f0bdf42f79cdcbb0c2b43c7b81b0 "OpenMP: Move omp requires checks to libgomp", we're now using 'return -1' from 'GOMP_OFFLOAD_get_num_devices' for 'omp_requires_mask' purposes. This missed that via 'nvptx_get_num_devices', we could also 'return -1' for 'cuDeviceGetCount' failure. Before, this meant (in 'gomp_target_init') to silently ignore the plugin/device -- which also has been doubtful behavior. Let's instead turn 'cuDeviceGetCount' failure into a fatal error, similar to other errors during device initialization. libgomp/ * plugin/plugin-nvptx.c (nvptx_get_num_devices): 'cuDeviceGetCount' failure is fatal. --- libgomp/plugin/plugin-nvptx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c index ffb1db67d20..81b4a7f499a 100644 --- a/libgomp/plugin/plugin-nvptx.c +++ b/libgomp/plugin/plugin-nvptx.c @@ -630,7 +630,7 @@ nvptx_get_num_devices (void) } } - CUDA_CALL_ERET (-1, cuDeviceGetCount, &n); + CUDA_CALL_ASSERT (cuDeviceGetCount, &n); return n; } -- 2.34.1