Message ID | 20200629201653.147326-5-julian@codesourcery.com |
---|---|
State | New |
Headers | show |
Series | [og10] amdgcn: Add waitcnt after LDS write instructions | expand |
On 29/06/2020 21:16, Julian Brown wrote: > Data-share write (ds_write) instructions do not necessarily complete > the write to LDS immediately. When a write completes, LGKM_CNT is > decremented. For now, we wait until LGKM_CNT reaches zero after each > ds_write instruction. > > This fixes a race condition in the case where LDS is read immediately > after being written. This can happen with broadcast operations. > > OK for og10 branch? I'm not saying no (because this issue needs a fix), but the thought occurs that inserting one wait before the barrier might be better than inserting a wait after each and every write. In particular, it seems logical that any barrier should be a memory barrier, so inserting it in the barrier pattern is not a big deal. IIRC, only OpenACC is using that anyway (OpenMP has explicit asm inserts in libgomp). WDYT? Andrew
On Mon, 29 Jun 2020 21:32:41 +0100 Andrew Stubbs <ams@codesourcery.com> wrote: > On 29/06/2020 21:16, Julian Brown wrote: > > Data-share write (ds_write) instructions do not necessarily complete > > the write to LDS immediately. When a write completes, LGKM_CNT is > > decremented. For now, we wait until LGKM_CNT reaches zero after each > > ds_write instruction. > > > > This fixes a race condition in the case where LDS is read > > immediately after being written. This can happen with broadcast > > operations. > > > > OK for og10 branch? > > I'm not saying no (because this issue needs a fix), but the thought > occurs that inserting one wait before the barrier might be better > than inserting a wait after each and every write. > > In particular, it seems logical that any barrier should be a memory > barrier, so inserting it in the barrier pattern is not a big deal. > IIRC, only OpenACC is using that anyway (OpenMP has explicit asm > inserts in libgomp). I'd be happier with that idea if ds_{read,write} operations were *only* used for broadcasting -- but they're not, they may also be used for (some) gang-private variables and for reduction temporaries. I don't have a test case for either of those at present demonstrating bad behaviour with no waitcnt, but I guess it's theoretically possible for there to be one, at least. The "proper" solution is a general (& "optimal") waitcnt insertion pass, I think, that works with other memory operations as well as the DS ones. Thanks, Julian
diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md index 6d7fecaa12c2..9dfaec1d0645 100644 --- a/gcc/config/gcn/gcn-valu.md +++ b/gcc/config/gcn/gcn-valu.md @@ -923,7 +923,7 @@ { addr_space_t as = INTVAL (operands[3]); static char buf[200]; - sprintf (buf, "ds_write%%b2\t%%0, %%2 offset:%%1%s", + sprintf (buf, "ds_write%%b2\t%%0, %%2 offset:%%1%s\;s_waitcnt\tlgkmcnt(0)", (AS_GDS_P (as) ? " gds" : "")); return buf; } diff --git a/gcc/config/gcn/gcn.md b/gcc/config/gcn/gcn.md index 8cfb3a85d256..e58669240c67 100644 --- a/gcc/config/gcn/gcn.md +++ b/gcc/config/gcn/gcn.md @@ -554,7 +554,7 @@ flat_load_dword\t%0, %A1%O1%g1\;s_waitcnt\t0 flat_store_dword\t%A0, %1%O0%g0 v_mov_b32\t%0, %1 - ds_write_b32\t%A0, %1%O0 + ds_write_b32\t%A0, %1%O0\;s_waitcnt\tlgkmcnt(0) ds_read_b32\t%0, %A1%O1\;s_waitcnt\tlgkmcnt(0) s_mov_b32\t%0, %1 global_load_dword\t%0, %A1%O1%g1\;s_waitcnt\tvmcnt(0) @@ -582,7 +582,7 @@ flat_load%o1\t%0, %A1%O1%g1\;s_waitcnt\t0 flat_store%s0\t%A0, %1%O0%g0 v_mov_b32\t%0, %1 - ds_write%b0\t%A0, %1%O0 + ds_write%b0\t%A0, %1%O0\;s_waitcnt\tlgkmcnt(0) ds_read%u1\t%0, %A1%O1\;s_waitcnt\tlgkmcnt(0) global_load%o1\t%0, %A1%O1%g1\;s_waitcnt\tvmcnt(0) global_store%s0\t%A0, %1%O0%g0" @@ -611,7 +611,7 @@ # flat_load_dwordx2\t%0, %A1%O1%g1\;s_waitcnt\t0 flat_store_dwordx2\t%A0, %1%O0%g0 - ds_write_b64\t%A0, %1%O0 + ds_write_b64\t%A0, %1%O0\;s_waitcnt\tlgkmcnt(0) ds_read_b64\t%0, %A1%O1\;s_waitcnt\tlgkmcnt(0) global_load_dwordx2\t%0, %A1%O1%g1\;s_waitcnt\tvmcnt(0) global_store_dwordx2\t%A0, %1%O0%g0" @@ -667,7 +667,7 @@ # global_store_dwordx4\t%A0, %1%O0%g0 global_load_dwordx4\t%0, %A1%O1%g1\;s_waitcnt\tvmcnt(0) - ds_write_b128\t%A0, %1%O0 + ds_write_b128\t%A0, %1%O0\;s_waitcnt\tlgkmcnt(0) ds_read_b128\t%0, %A1%O1\;s_waitcnt\tlgkmcnt(0)" "reload_completed && REG_P (operands[0])