Message ID | alpine.LNX.2.20.1511121609140.3050@monopod.intra.ispras.ru |
---|---|
State | New |
Headers | show |
> I'm proposing the following patch as a step towards resolving the issue with > inaccessibility of stack storage (.local memory) in PTX to other threads than > the one using that stack. The idea is to have preallocated stacks, and have > __nvptx_stacks[] array in shared memory hold current stack pointers. Each > thread is maintaining __nvptx_stacks[tid.y] as its stack pointer, thus for > OpenMP the intent is to preallocate on a per-warp basis (not per-thread). > For OpenMP SIMD regions we'll have to ensure that conflicting accesses are not > introduced. This is of course really ugly; I'd propose we keep it on an nvptx-OpenMP specific branch for now until we know that this is really going somewhere. > I've run it through make -k check-c regtesting. These are new fails, all > mysterious: These would have to be investigated first. > + sz = (sz + keep_align - 1) & ~(keep_align - 1); Use the ROUND_UP macro. > + fprintf (file, "\tmul%s.u32 %%fstmp1, %%fstmp0, %d;\n", > + bits == 64 ? ".wide" : "", bits); Use a shift. > + > + if (need_softstack_decl) > + { > + fprintf (asm_out_file, ".extern .shared .u64 __nvptx_stacks[];\n;"); > + } Lose excess braces. > +.global .u64 %__softstack[16384]; Maybe declarea as .u8 so you don't have two different constants for the stack size? > + .reg .u64 %stackptr; > + mov.u64 %stackptr, %__softstack; > + cvta.global.u64 %stackptr, %stackptr; > + add.u64 %stackptr, %stackptr, 131072; > + st.shared.u64 [__nvptx_stacks], %stackptr; > + I'm guessing you have other missing pieces for setting this up for multiple threads. Bernd
On Thu, 12 Nov 2015, Bernd Schmidt wrote: > > I've run it through make -k check-c regtesting. These are new fails, all > > mysterious: > > These would have to be investigated first. Any specific suggestions? The PTX code emitted from GCC differs only in prologue/epilogue, so whatever's broken... I think is unlikely due to this change. I can give it another try after upgrading CUDA driver and cuda-gdb from 7.0 to latest. > > + sz = (sz + keep_align - 1) & ~(keep_align - 1); > > Use the ROUND_UP macro. OK, thanks. > > + fprintf (file, "\tmul%s.u32 %%fstmp1, %%fstmp0, %d;\n", > > + bits == 64 ? ".wide" : "", bits); > > Use a shift. I think mul is acceptable here: PTX JIT is handling it properly, according to what I saw while investigating in cuda-gdb. If I used a shift, I'd also have to introduce another instruction for a widening integer conversion in the 64-bit case. Do you insist? > > + > > + if (need_softstack_decl) > > + { > > + fprintf (asm_out_file, ".extern .shared .u64 __nvptx_stacks[];\n;"); > > + } > > Lose excess braces. OK. > > +.global .u64 %__softstack[16384]; > > Maybe declarea as .u8 so you don't have two different constants for the stack > size? OK, with ".align 8" to ensure 64-bit alignment. > > + .reg .u64 %stackptr; > > + mov.u64 %stackptr, %__softstack; > > + cvta.global.u64 %stackptr, %stackptr; > > + add.u64 %stackptr, %stackptr, 131072; > > + st.shared.u64 [__nvptx_stacks], %stackptr; > > + > > I'm guessing you have other missing pieces for setting this up for multiple > threads. This is crt0.s, which is linked in only for single-threaded testing with -mmainkernel; for OpenMP, the intention is to handle it in the file that implements libgomp_nvptx_main. Thanks. Alexander
On 11/12/2015 03:59 PM, Alexander Monakov wrote: > On Thu, 12 Nov 2015, Bernd Schmidt wrote: >>> I've run it through make -k check-c regtesting. These are new fails, all >>> mysterious: >> >> These would have to be investigated first. > > Any specific suggestions? The PTX code emitted from GCC differs only in > prologue/epilogue, so whatever's broken... I think is unlikely due to this > change. I can give it another try after upgrading CUDA driver and cuda-gdb > from 7.0 to latest. Yeah, load it into cuda-gdb, that may help show what's happening. >>> + fprintf (file, "\tmul%s.u32 %%fstmp1, %%fstmp0, %d;\n", >>> + bits == 64 ? ".wide" : "", bits); >> >> Use a shift. > > I think mul is acceptable here: PTX JIT is handling it properly, according to > what I saw while investigating in cuda-gdb. If I used a shift, I'd also have > to introduce another instruction for a widening integer conversion in the > 64-bit case. Do you insist? Nah, it's fine. > > This is crt0.s, which is linked in only for single-threaded testing with > -mmainkernel; for OpenMP, the intention is to handle it in the file that > implements libgomp_nvptx_main. Yeah, that's what I meant. It might be nice to see that too if it already exists. Bernd
On Thu, Nov 12, 2015 at 04:58:21PM +0300, Alexander Monakov wrote: > I'm proposing the following patch as a step towards resolving the issue with > inaccessibility of stack storage (.local memory) in PTX to other threads than > the one using that stack. The idea is to have preallocated stacks, and have > __nvptx_stacks[] array in shared memory hold current stack pointers. Each > thread is maintaining __nvptx_stacks[tid.y] as its stack pointer, thus for > OpenMP the intent is to preallocate on a per-warp basis (not per-thread). > For OpenMP SIMD regions we'll have to ensure that conflicting accesses are not > introduced. > > I've exposed a new command-line option -msoft-stack to ease testing, but for > OpenMP we'll have to automatically flip it based on function attributes. > Right now it's not easy because OpenMP and OpenACC both use "omp declare > target". Jakub, I seem to recall a discussion about OpenACC changing to use a > separate attribute, but I cannot find it now. Any advice here? I believe OpenACC has acc routine {gang,worker,seq} that would roughly match whether certain OpenMP declare target function (or ompfn region) is/can be called within the target/teams/distribute context, or parallel context, or simd context. For OpenMP we have no such pragmas, so we need some analysis to help the PTX (and, as Martin said on IRC, HSA apparently too) and add attributes accordingly. For the .ompfn* outlined region it is easy, there we know from which construct it is, for other functions bet we want to do some IPA analysis for this, start with the .ompfn* functions marked and walk the cgraph and for declare target functions not callable from outside try to determine if they are only called from parallel contexts, or not. Does your patch affect all the stack allocations within certain function (i.e. no way to select on a per-variable bases what stack to allocate it to)? Without any detailed analysis at least e.g. spilled (non-addressable) vars could at least go to the local stack. But PTX doesn't have any spills, right? Not sure about HSA. If it is a per-function thing only, then it isn't worth to do more detailed analysis at the ompexp time. BTW, surely it will be an advantage if PTX can support alloca through this, it could e.g. turn on -msoft-stack for all functions that use alloca/VLAs automatically. Jakub
On Mon, 30 Nov 2015, Jakub Jelinek wrote: > Does your patch affect all the stack allocations within certain function > (i.e. no way to select on a per-variable bases what stack to allocate it > to)? Without any detailed analysis at least e.g. spilled (non-addressable) > vars could at least go to the local stack. But PTX doesn't have any spills, > right? Not sure about HSA. If it is a per-function thing only, then it > isn't worth to do more detailed analysis at the ompexp time. Yes, there's no register allocation and thus no spills on PTX. > BTW, surely it will be an advantage if PTX can support alloca through this, > it could e.g. turn on -msoft-stack for all functions that use alloca/VLAs > automatically. Yes, I'm going to support alloca on soft stacks, but, -msoft-stack has a prerequisite of soft stacks being initially set up. Therefore I'm treating it as an ABI variant (together with another option to handle atomics and "syscalls" outside of simd regions), and building a separate multilib for that. So I see it the other way around: it's not safe for the compiler to always use soft-stacks for alloca (because OpenACC wouldn't set up soft stacks), but if soft stacks are enabled, alloca can use them. In the multilib variant that I'm introducing, all addressable vars go to soft stacks, and classic .local stacks are used rarely, e.g. for stdarg passing, and implicitely for calls/returns (and after JIT, they'll service register spills too). Alexander
On Mon, Nov 30, 2015 at 02:14:41PM +0300, Alexander Monakov wrote: > On Mon, 30 Nov 2015, Jakub Jelinek wrote: > > Does your patch affect all the stack allocations within certain function > > (i.e. no way to select on a per-variable bases what stack to allocate it > > to)? Without any detailed analysis at least e.g. spilled (non-addressable) > > vars could at least go to the local stack. But PTX doesn't have any spills, > > right? Not sure about HSA. If it is a per-function thing only, then it > > isn't worth to do more detailed analysis at the ompexp time. > > Yes, there's no register allocation and thus no spills on PTX. > > > BTW, surely it will be an advantage if PTX can support alloca through this, > > it could e.g. turn on -msoft-stack for all functions that use alloca/VLAs > > automatically. > > Yes, I'm going to support alloca on soft stacks, but, -msoft-stack has a > prerequisite of soft stacks being initially set up. Therefore I'm treating it > as an ABI variant (together with another option to handle atomics and > "syscalls" outside of simd regions), and building a separate multilib for > that. So I see it the other way around: it's not safe for the compiler to > always use soft-stacks for alloca (because OpenACC wouldn't set up soft > stacks), but if soft stacks are enabled, alloca can use them. > > In the multilib variant that I'm introducing, all addressable vars go to soft > stacks, and classic .local stacks are used rarely, e.g. for stdarg passing, > and implicitely for calls/returns (and after JIT, they'll service register > spills too). Does it really have to be a full multilib? I mean, the only precondition is that something sets up the var, right? Would the OpenACC folks be willing to set it up too? For stuff like libc, functions that are ECF_LEAF builtins IMHO really don't care whether they are built as -msoft-stack or not, they shouldn't be passing addresses of local vars to code that could use OpenMP. The only question is if say qsort or other functions that call user callbacks could be passing addresses of local vars to those callbacks, or whether they only pass addresses passed from callers, or addresses of heap objects. Jakub
On Mon, 30 Nov 2015, Jakub Jelinek wrote: > Does it really have to be a full multilib? I mean, the only precondition is > that something sets up the var, right? Would the OpenACC folks be willing > to set it up too? For stuff like libc, functions that are ECF_LEAF builtins > IMHO really don't care whether they are built as -msoft-stack or not, they > shouldn't be passing addresses of local vars to code that could use OpenMP. > The only question is if say qsort or other functions that call user callbacks > could be passing addresses of local vars to those callbacks, or whether they > only pass addresses passed from callers, or addresses of heap objects. Well, in that full multilib there's also a second option enabled, to handle atomics/syscalls outside of simd regions, where the cost is additional code in the prologue, and one "shuffle" instruction after each atomic/syscall. It doesn't have to be a multilib, but doing it as a multilib is a safe choice w.r.t OpenACC work. Alexander
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index 0204ad3..df915b9 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -114,6 +114,9 @@ static unsigned worker_red_align; #define worker_red_name "__worker_red" static GTY(()) rtx worker_red_sym; +/* True if any function references __nvptx_stacks. */ +static bool need_softstack_decl; + /* Allocate a new, cleared machine_function structure. */ static struct machine_function * @@ -689,15 +692,46 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl) /* Declare a local variable for the frame. */ sz = get_frame_size (); - if (sz > 0 || cfun->machine->has_call_with_sc) + if (sz == 0 && cfun->machine->has_call_with_sc) + sz = 1; + if (sz > 0) { int alignment = crtl->stack_alignment_needed / BITS_PER_UNIT; - fprintf (file, "\t.reg.u%d %%frame;\n" - "\t.local.align %d .b8 %%farray[" HOST_WIDE_INT_PRINT_DEC"];\n", - BITS_PER_WORD, alignment, sz == 0 ? 1 : sz); - fprintf (file, "\tcvta.local.u%d %%frame, %%farray;\n", - BITS_PER_WORD); + fprintf (file, "\t.reg.u%d %%frame;\n", BITS_PER_WORD); + if (TARGET_SOFT_STACK) + { + /* Maintain 64-bit stack alignment. */ + int keep_align = BIGGEST_ALIGNMENT / BITS_PER_UNIT; + sz = (sz + keep_align - 1) & ~(keep_align - 1); + int bits = BITS_PER_WORD; + fprintf (file, "\t.reg.u32 %%fstmp0;\n"); + fprintf (file, "\t.reg.u%d %%fstmp1;\n", bits); + fprintf (file, "\t.reg.u%d %%fstmp2;\n", bits); + fprintf (file, "\tmov.u32 %%fstmp0, %%tid.y;\n"); + fprintf (file, "\tmul%s.u32 %%fstmp1, %%fstmp0, %d;\n", + bits == 64 ? ".wide" : "", bits); + fprintf (file, "\tmov.u%d %%fstmp2, __nvptx_stacks;\n", bits); + /* fstmp2 = &__nvptx_stacks[tid.y]; */ + fprintf (file, "\tadd.u%d %%fstmp2, %%fstmp2, %%fstmp1;\n", bits); + fprintf (file, "\tld.shared.u%d %%fstmp1, [%%fstmp2];\n", bits); + fprintf (file, "\tsub.u%d %%frame, %%fstmp1, " + HOST_WIDE_INT_PRINT_DEC ";\n", bits, sz); + if (alignment > keep_align) + fprintf (file, "\tand.b%d %%frame, %%frame, %d;\n", + bits, -alignment); + if (!crtl->is_leaf) + fprintf (file, "\tst.shared.u%d [%%fstmp2], %%frame;\n", bits); + need_softstack_decl = true; + } + else + { + fprintf (file, "\t.local.align %d " + ".b8 %%farray[" HOST_WIDE_INT_PRINT_DEC"];\n", + alignment, sz); + fprintf (file, "\tcvta.local.u%d %%frame, %%farray;\n", + BITS_PER_WORD); + } } if (cfun->machine->has_call_with_varargs) @@ -734,6 +768,13 @@ nvptx_output_return (void) { machine_mode mode = (machine_mode)cfun->machine->ret_reg_mode; + if (TARGET_SOFT_STACK + && !crtl->is_leaf + && (get_frame_size () > 0 || cfun->machine->has_call_with_sc)) + { + int bits = BITS_PER_WORD; + fprintf (asm_out_file, "\tst.shared.u%d [%%fstmp2], %%fstmp1;\n", bits); + } if (mode != VOIDmode) { mode = arg_promotion (mode); @@ -3278,6 +3319,11 @@ nvptx_file_end (void) worker_red_align, worker_red_name, worker_red_size); } + + if (need_softstack_decl) + { + fprintf (asm_out_file, ".extern .shared .u64 __nvptx_stacks[];\n;"); + } } /* Expander for the shuffle builtins. */ diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt index 8017046..7ab09b9 100644 --- a/gcc/config/nvptx/nvptx.opt +++ b/gcc/config/nvptx/nvptx.opt @@ -28,3 +28,7 @@ Generate code for a 64-bit ABI. mmainkernel Target Report RejectNegative Link in code for a __main kernel. + +msoft-stack +Target Report Mask(SOFT_STACK) +Use custom stacks instead of local memory for automatic storage. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 587e30e..6e45fb6 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -18935,6 +18935,13 @@ Generate code for 32-bit or 64-bit ABI. Link in code for a __main kernel. This is for stand-alone instead of offloading execution. +@item -msoft-stack +@opindex msoft-stack +Do not use @code{.local} memory for automatic storage. Instead, use pointer +in shared memory array @code{char *__nvptx_stacks[]} at position @code{tid.y} +as the stack pointer. This is for placing automatic variables into storage +that can be accessed from other threads, or modified with atomic instructions. + @end table @node PDP-11 Options diff --git a/libgcc/config/nvptx/crt0.s b/libgcc/config/nvptx/crt0.s index 38327ed..7a42e87 100644 --- a/libgcc/config/nvptx/crt0.s +++ b/libgcc/config/nvptx/crt0.s @@ -22,6 +22,9 @@ exit; } +.visible .shared .u64 __nvptx_stacks[1]; +.global .u64 %__softstack[16384]; + .extern .func (.param.u32 retval) main (.param.u32 argc, .param.u64 argv); .visible .entry __main (.param .u64 __retval, .param.u32 __argc, .param.u64 __argv) @@ -34,6 +37,12 @@ ld.param.u64 %rd0, [__retval]; st.global.u64 [%__exitval], %rd0; + .reg .u64 %stackptr; + mov.u64 %stackptr, %__softstack; + cvta.global.u64 %stackptr, %stackptr; + add.u64 %stackptr, %stackptr, 131072; + st.shared.u64 [__nvptx_stacks], %stackptr; + ld.param.u32 %r1, [__argc]; ld.param.u64 %rd1, [__argv]; st.param.u32 [%argc], %r1;