Message ID | 87oacheqlz.fsf@hertz.schwinge.homeip.net |
---|---|
State | New |
Headers | show |
On Tue, 19 Jan 2016, Thomas Schwinge wrote: > Hi! > > With nvptx offloading, in one OpenACC test case, we're running into the > following fatal error (GOMP_DEBUG=1 output): > > [...] > info : Function properties for 'LBM_performStreamCollide$_omp_fn$0': > info : used 87 registers, 0 stack, 8 bytes smem, 328 bytes cmem[0], 80 bytes cmem[2], 0 bytes lmem > [...] > nvptx_exec: kernel LBM_performStreamCollide$_omp_fn$0: launch gangs=32, workers=32, vectors=32 > > libgomp: cuLaunchKernel error: too many resources requested for launch > > Very likely this means that the number of registers used in this function > ("used 87 registers"), multiplied by the thread block size (workers * > vectors, "workers=32, vectors=32"), exceeds the hardware maximum. Yes, today most CUDA GPUs allow 64K regs per block, some allow 32K, so 87*32*32 definitely overflows that limit. A reference is available in CUDA C Programming, appendix G, table 13: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities > (One problem certainly might be that we're currently not doing any > register allocation for nvptx, as far as I remember based on the idea > that PTX is only a "virtual ISA", and the PTX JIT compiler would "fix > this up" for us -- which I'm not sure it actually is doing?) (well, if you want I can point out that 1) GCC never emits launch bounds so PTX JIT has to guess limits -- that's something I'd like to play with in the future, time permitting 2) OpenACC register copying at forks increases (pseudo-)register pressure 3) I think if you inspect PTX code you'll see it used way more than 87 regs) As for the proposed patch, does the OpenACC spec leave the implementation freedom to spawn a different number of workers than requested? (honest question -- I didn't look at the spec that closely) > Alternatively/additionally, we could try experimenting with using the > following of enum CUjit_option "Online compiler and linker options": [snip] > ..., to have the PTX JIT reduce the number of live registers (if > possible; I don't know), and/or could try experimenting with querying the > active device, enum CUdevice_attribute "Device properties": > > [...] > CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_BLOCK = 12 > Maximum number of 32-bit registers available per block > [...] > > ..., and use that in combination with each function's enum > CUfunction_attribute "Function properties": [snip] > ... to determine an optimal number of threads per block given the number > of registers (maybe just querying CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK > would do that already?). I have implemented that for OpenMP offloading, but also since CUDA 6.0 there's cuOcc* (occupancy query) interface that allows to simply ask the driver about the per-function launch limit. Thanks. Alexander
On 01/19/16 06:49, Thomas Schwinge wrote: > (One problem certainly might be that we're currently not doing any > register allocation for nvptx, as far as I remember based on the idea > that PTX is only a "virtual ISA", and the PTX JIT compiler would "fix > this up" for us -- which I'm not sure it actually is doing?) My understanding is that the JIT compiler does register allocation. > int axis = get_oacc_ifn_dim_arg (call); > + if (axis == GOMP_DIM_WORKER) > + { > + /* libgomp's nvptx plugin might potentially modify > + dims[GOMP_DIM_WORKER]. */ > + return NULL_TREE; > + } this is almost certainly wrong. You're preventing constant folding in the compiler. nathan
On Tue, 19 Jan 2016, Alexander Monakov wrote: > > ... to determine an optimal number of threads per block given the number > > of registers (maybe just querying CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK > > would do that already?). > > I have implemented that for OpenMP offloading, but also since CUDA 6.0 there's > cuOcc* (occupancy query) interface that allows to simply ask the driver about > the per-function launch limit. Sorry, I should have mentioned that CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK is indeed sufficient for limiting threads per block, which is trivially translatable into workers per gang in OpenACC. IMO it's also a cleaner approach in this case, compared to iterative backoff (if, again, the implementation is free to do that). When mentioning cuOcc* I was thinking about finding an optimal number of blocks per device, which is a different story. Alexander
Hi! On Tue, 19 Jan 2016 08:47:02 -0500, Nathan Sidwell <nathan@acm.org> wrote: > On 01/19/16 06:49, Thomas Schwinge wrote: > > int axis = get_oacc_ifn_dim_arg (call); > > + if (axis == GOMP_DIM_WORKER) > > + { > > + /* libgomp's nvptx plugin might potentially modify > > + dims[GOMP_DIM_WORKER]. */ > > + return NULL_TREE; > > + } > > this is almost certainly wrong. You're preventing constant folding in the > compiler. Yes, because if libgomp can modify dims[GOMP_DIM_WORKER], in the compiler we can no assume it to be constant? (Did result in a run-time test verification failure.) Of course, my hammer might be a too big one (which is why this is a RFC). Grüße Thomas
Hi! On Tue, 19 Jan 2016 17:07:17 +0300, Alexander Monakov <amonakov@ispras.ru> wrote: > On Tue, 19 Jan 2016, Alexander Monakov wrote: > > > ... to determine an optimal number of threads per block given the number > > > of registers (maybe just querying CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK > > > would do that already?). > > > > I have implemented that for OpenMP offloading, but also since CUDA 6.0 there's > > cuOcc* (occupancy query) interface that allows to simply ask the driver about > > the per-function launch limit. You mean you already have implemented something along the lines I proposed? > Sorry, I should have mentioned that CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK is > indeed sufficient for limiting threads per block, which is trivially > translatable into workers per gang in OpenACC. That's good to know, thanks! > IMO it's also a cleaner > approach in this case, compared to iterative backoff (if, again, the > implementation is free to do that). It is not explicitly spelled out in OpenACC 2.0a, but it got clarified in OpenACC 2.5. See "2.5.7. num workers clause": "[...] The implementation may use a different value than specified based on limitations imposed by the target architecture". > When mentioning cuOcc* I was thinking about finding an optimal number of > blocks per device, which is a different story. :-) Grüße Thomas
On Tue, 19 Jan 2016, Thomas Schwinge wrote: > Hi! > > On Tue, 19 Jan 2016 17:07:17 +0300, Alexander Monakov <amonakov@ispras.ru> wrote: > > On Tue, 19 Jan 2016, Alexander Monakov wrote: > > > > ... to determine an optimal number of threads per block given the number > > > > of registers (maybe just querying CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK > > > > would do that already?). > > > > > > I have implemented that for OpenMP offloading, but also since CUDA 6.0 there's > > > cuOcc* (occupancy query) interface that allows to simply ask the driver about > > > the per-function launch limit. > > You mean you already have implemented something along the lines I > proposed? Yes, I was implementing OpenMP teams, and it made sense to add warps per block limiting at the same time (i.e. query CU_FUNC_ATTRIBUTE_... and limit if default or requested number of threads per team is too high). I intend to post that patch as part of a larger series shortly (but the patch itself is simple enough, although a small tweak will be needed to make it apply to OpenACC too). Alexander
On Tue, 19 Jan 2016, Alexander Monakov wrote: > > You mean you already have implemented something along the lines I > > proposed? > > Yes, I was implementing OpenMP teams, and it made sense to add warps per block > limiting at the same time (i.e. query CU_FUNC_ATTRIBUTE_... and limit if > default or requested number of threads per team is too high). I intend to > post that patch as part of a larger series shortly (but the patch itself is > simple enough, although a small tweak will be needed to make it apply to > OpenACC too). Here's the patch I was talking about: https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=04e68c22081c36caf5da9d9f4ca5e895e1088c78;hp=735c8a7d88a7e14cb707f22286678982174175a6 Alexander
diff --git gcc/gimple-fold.c gcc/gimple-fold.c index a0e7b7e..e75c58e 100644 --- gcc/gimple-fold.c +++ gcc/gimple-fold.c @@ -2935,6 +2935,13 @@ fold_internal_goacc_dim (const gimple *call) return NULL_TREE; int axis = get_oacc_ifn_dim_arg (call); + if (axis == GOMP_DIM_WORKER) + { + /* libgomp's nvptx plugin might potentially modify + dims[GOMP_DIM_WORKER]. */ + return NULL_TREE; + } + int size = get_oacc_fn_dim_size (current_function_decl, axis); bool is_pos = gimple_call_internal_fn (call) == IFN_GOACC_DIM_POS; tree result = NULL_TREE; diff --git gcc/tree-vrp.c gcc/tree-vrp.c index e6c11e0..a0a78d2 100644 --- gcc/tree-vrp.c +++ gcc/tree-vrp.c @@ -3980,6 +3980,7 @@ extract_range_basic (value_range *vr, gimple *stmt) break; case CFN_GOACC_DIM_SIZE: case CFN_GOACC_DIM_POS: + //TODO: is this kosher regarding libgomp's nvptx plugin potentially modifying dims[GOMP_DIM_WORKER]? /* Optimizing these two internal functions helps the loop optimizer eliminate outer comparisons. Size is [1,N] and pos is [0,N-1]. */ diff --git libgomp/plugin/plugin-nvptx.c libgomp/plugin/plugin-nvptx.c index eea74d4..54fd5cb 100644 --- libgomp/plugin/plugin-nvptx.c +++ libgomp/plugin/plugin-nvptx.c @@ -974,24 +974,36 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs, r = cuMemcpy ((CUdeviceptr)dp, (CUdeviceptr)hp, mapnum * sizeof (void *)); if (r != CUDA_SUCCESS) GOMP_PLUGIN_fatal ("cuMemcpy failed: %s", cuda_error (r)); + kargs[0] = &dp; - GOMP_PLUGIN_debug (0, " %s: kernel %s: launch" - " gangs=%u, workers=%u, vectors=%u\n", - __FUNCTION__, targ_fn->launch->fn, dims[GOMP_DIM_GANG], - dims[GOMP_DIM_WORKER], dims[GOMP_DIM_VECTOR]); - + launch: // OpenACC CUDA // // num_gangs nctaid.x // num_workers ntid.y // vector length ntid.x - - kargs[0] = &dp; + GOMP_PLUGIN_debug (0, " %s: kernel %s: launch" + " gangs=%u, workers=%u, vectors=%u\n", + __FUNCTION__, targ_fn->launch->fn, dims[GOMP_DIM_GANG], + dims[GOMP_DIM_WORKER], dims[GOMP_DIM_VECTOR]); r = cuLaunchKernel (function, dims[GOMP_DIM_GANG], 1, 1, dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1, 0, dev_str->stream, kargs, 0); - if (r != CUDA_SUCCESS) + if (r == CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES) + { + /* Don't give up just yet; possibly too many threads for the kernel's + register count. */ + if (dims[GOMP_DIM_WORKER] > 1) + { + dims[GOMP_DIM_WORKER] /= 2; + GOMP_PLUGIN_debug (0, " cuLaunchKernel: " + "CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES; retrying " + "with reduced number of workers\n"); + goto launch; + } + } + if (r != CUDA_SUCCESS) //CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r)); #ifndef DISABLE_ASYNC