Message ID | 1445366076-16082-12-git-send-email-amonakov@ispras.ru |
---|---|
State | New |
Headers | show |
On 10/20/2015 08:34 PM, Alexander Monakov wrote: > NVPTX does not support alloca or variable-length stack allocations, thus > heap allocation needs to be used instead. I've opted to make this a generic > change instead of guarding it with an #ifdef: libgomp usually leaves thread > stack size up to libc, so avoiding unbounded stack allocation makes sense. > > * task.c (GOMP_task): Use a fixed-size on-stack buffer or a heap > allocation instead of a variable-size on-stack allocation. > + char buf_fixed[2048], *buf = buf_fixed; This might also not be the best of ideas on a GPU - the stack size isn't all that unlimited, what with there being lots of threads. If I do size_t stack, heap; cuCtxGetLimit (&stack, CU_LIMIT_STACK_SIZE); in the nvptx-run program we've used for testing, it shows a default stack size of just 1kB. Bernd
On Tue, 20 Oct 2015, Bernd Schmidt wrote: > On 10/20/2015 08:34 PM, Alexander Monakov wrote: > > NVPTX does not support alloca or variable-length stack allocations, thus > > heap allocation needs to be used instead. I've opted to make this a generic > > change instead of guarding it with an #ifdef: libgomp usually leaves thread > > stack size up to libc, so avoiding unbounded stack allocation makes sense. > > > > * task.c (GOMP_task): Use a fixed-size on-stack buffer or a heap > > allocation instead of a variable-size on-stack allocation. > > > + char buf_fixed[2048], *buf = buf_fixed; > > This might also not be the best of ideas on a GPU - the stack size isn't all > that unlimited, what with there being lots of threads. If I do > > size_t stack, heap; > cuCtxGetLimit (&stack, CU_LIMIT_STACK_SIZE); > > in the nvptx-run program we've used for testing, it shows a default stack size > of just 1kB. Thanks, NVPTX will need a low buf_fixed size, perhaps 64 bytes or so. What about the generic case, should it use a more generous threshold, or revert to existing unbounded alloca? Any ideas how big is the required allocation size is in practice? Thanks. Alexander
On 10/20/2015 11:36 PM, Alexander Monakov wrote: > Thanks, NVPTX will need a low buf_fixed size, perhaps 64 bytes or so. > What about the generic case, should it use a more generous threshold, > or revert to existing unbounded alloca? > > Any ideas how big is the required allocation size is in practice? I'll defer to Jakub for questions and patches that are more strongly libgomp-related than ptx-related. Bernd
On Tue, Oct 20, 2015 at 09:34:33PM +0300, Alexander Monakov wrote: > NVPTX does not support alloca or variable-length stack allocations, thus > heap allocation needs to be used instead. I've opted to make this a generic > change instead of guarding it with an #ifdef: libgomp usually leaves thread > stack size up to libc, so avoiding unbounded stack allocation makes sense. > > * task.c (GOMP_task): Use a fixed-size on-stack buffer or a heap > allocation instead of a variable-size on-stack allocation. I don't like this unconditionally. This really isn't unbounded, the buffer just contains the privatized variables. If one uses int c[124]; void foo (void) { int a, b[10]; #pragma omp parallel firstprivate (a, b, c) { use (a, b, c); } } then the private copies of the variables are allocated on the stack too, a, b already in addition to the original non-privatized vars a and b, c, which has been above a global var, is automatic just in the private copy. Now, for #pragma omp task firstprivate (a, b, c) if there are copy constructors involved, the copy ctors for the firstprivate vars need to be run before GOMP_task returns, and therefore we let those variables live in the heap rather than on the stack; but if we know the task needs to execute immediately, with the alloca we do pretty much the same thing as parallel does, all the privatized variables are allocated on the stack (just all of them together using alloca instead of individually by the compiler). I'm fine with temporarily having some #ifdef HAVE_BROKEN_ALLOCA or similar, but as nvptx I think doesn't support setjmp/longjmp nor computed goto, I think just supporting alloca by using malloc instead and freeing at the end of function if any allocations happened in the function is the right thing. > --- > libgomp/task.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/libgomp/task.c b/libgomp/task.c > index 74920d5..ffb7ed2 100644 > --- a/libgomp/task.c > +++ b/libgomp/task.c > @@ -162,11 +162,16 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *), > thr->task = &task; > if (__builtin_expect (cpyfn != NULL, 0)) > { > - char buf[arg_size + arg_align - 1]; > + long buf_size = arg_size + arg_align - 1; > + char buf_fixed[2048], *buf = buf_fixed; > + if (sizeof(buf_fixed) < buf_size) > + buf = gomp_malloc (buf_size); > char *arg = (char *) (((uintptr_t) buf + arg_align - 1) > & ~(uintptr_t) (arg_align - 1)); > cpyfn (arg, data); > fn (arg); > + if (buf != buf_fixed) > + free (buf); > } > else > fn (data); Jakub
diff --git a/libgomp/task.c b/libgomp/task.c index 74920d5..ffb7ed2 100644 --- a/libgomp/task.c +++ b/libgomp/task.c @@ -162,11 +162,16 @@ GOMP_task (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *), thr->task = &task; if (__builtin_expect (cpyfn != NULL, 0)) { - char buf[arg_size + arg_align - 1]; + long buf_size = arg_size + arg_align - 1; + char buf_fixed[2048], *buf = buf_fixed; + if (sizeof(buf_fixed) < buf_size) + buf = gomp_malloc (buf_size); char *arg = (char *) (((uintptr_t) buf + arg_align - 1) & ~(uintptr_t) (arg_align - 1)); cpyfn (arg, data); fn (arg); + if (buf != buf_fixed) + free (buf); } else fn (data);