Message ID | 20191003163505.49997-2-julian@codesourcery.com |
---|---|
State | New |
Headers | show |
Series | OpenACC reference count overhaul | expand |
Hi Julian! On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com> wrote: > This patch has been broken out of the patch supporting OpenACC 2.6 manual > deep copy last posted here: > > https://gcc.gnu.org/ml/gcc-patches/2018-12/msg01084.html Thanks. > a couple of > tests need fixing also Let's look at these first, and independently. The overall goal not being to bend test cases until they (again) work, but rather to verify what they're testing, so that they're valid OpenACC code, or if not that, then they're testing specifics of the GCC implementation (for example, the 'dg-shouldfail' test cases). > * testsuite/libgomp.oacc-c-c++-common/context-2.c: Use correct API to > deallocate acc_copyin'd data. > * testsuite/libgomp.oacc-c-c++-common/context-4.c: Likewise. > --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/context-2.c > +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/context-2.c > + acc_delete (&h_X[0], N * sizeof (float)); > + acc_delete (&h_Y1[0], N * sizeof (float)); > + > free (h_X); > free (h_Y1); > free (h_Y2); > > - acc_free (d_X); > - acc_free (d_Y); > --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/context-4.c > +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/context-4.c > + acc_delete (&h_X[0], N * sizeof (float)); > + acc_delete (&h_Y1[0], N * sizeof (float)); > + > free (h_X); > free (h_Y1); > free (h_Y2); > > - acc_free (d_X); > - acc_free (d_Y); ACK -- but do we understand why the same shouldn't be applied to the very similar 'libgomp.oacc-c-c++-common/context-1.c' and 'libgomp.oacc-c-c++-common/context-3.c', too? I suppose your testing of the "OpenACC reference count overhaul" tripped over these constructs? (Why just some, then?) The same pattern ('acc_copyin', 'acc_free') also appears in 'libgomp.oacc-c-c++-common/clauses-1.c', does that also need to be corrected? Same in 'libgomp.oacc-c-c++-common/lib-13.c' (... where that test case actually is titled "Check acc_is_present and acc_delete" instead of "[...] acc_free", huh), 'libgomp.oacc-c-c++-common/lib-14.c', 'libgomp.oacc-c-c++-common/lib-18.c'. Then, the 'acc_deviceptr', 'acc_unmap_data', 'acc_free' usage in 'libgomp.oacc-c-c++-common/clauses-1.c' also seems strange, as the respective 'acc_free' argument certainly is not (at least not directly) a "pointer value that was returned by a call to 'acc_malloc'". Does it make sense to (continue to) support that, assuming that's how it's implemented internally, or should these be corrected to valid OpenACC, too? Same in 'libgomp.oacc-c-c++-common/present-1.c'. Same in 'libgomp.oacc-c-c++-common/clauses-2.c' (we 'dg-shouldfail' earlier, but the later code should otherwise be made correct anyway). Several of these things again in 'libgomp.oacc-c-c++-common/nested-1.c'. (The other 'libgomp.oacc-c-c++-common/lib-*.c' ones are correctly pairing 'acc_malloc', 'acc_free', as far as I can tell.) > --- a/libgomp/testsuite/libgomp.oacc-fortran/data-2.f90 > +++ b/libgomp/testsuite/libgomp.oacc-fortran/data-2.f90 > @@ -70,10 +71,14 @@ program test > end do > !$acc end parallel > > - !$acc exit data copyout (d(1:N)) async > + !$acc exit data delete (c(1:N)) copyout (d(1:N)) async > !$acc exit data async > !$acc wait ACK, but also it seems to me as if the '!$acc exit data async' (currently "clause-less") was meant to carry the 'delete (c(1:N))' clause? > @@ -1,4 +1,5 @@ > ! { dg-do run } > +! { dg-additional-options "-cpp" } > [...] > +#if !ACC_MEM_SHARED > + if (acc_is_present (c) .eqv. .TRUE.) call abort > +#endif ;-) Should be able to simplify that one to 'if (acc_is_present (c))', no? But is that a really useful test here: don't we elsewhere have enough of such 'acc_is_present' testing? (That is, OK to keep that, but likewise OK to drop that.) And, just for background information: per PR84381, it has been suggested to use the Fortran standard 'stop' (or was it 'error stop'?) instead of 'call abort'. But no need to change that here individually; the libgomp testsuite still (or, again?) contains a lot of 'call abort'. > + > do i = 1, N > if (d(i) .ne. 4.0) call abort > end do ..., for example, here. ;-) (For avoidance of doubt, I'm not asking you to change these now.) So, please address these items first, as separate "Fix OpenACC test cases regarding 'acc_malloc', 'acc_free' pairing", and "Fix OpenACC test case for unstructured data regions" (or similar) commits. If you're confident you're doing "the obvious", feel free to commit without further review. Grüße Thomas
On Tue, 15 Oct 2019 17:30:06 +0200 Thomas Schwinge <thomas@codesourcery.com> wrote: > Hi Julian! > > On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com> > wrote: > > This patch has been broken out of the patch supporting OpenACC 2.6 > > manual deep copy last posted here: > > > > https://gcc.gnu.org/ml/gcc-patches/2018-12/msg01084.html > > Thanks. > > > > a couple of > > tests need fixing also > > Let's look at these first, and independently. > > The overall goal not being to bend test cases until they (again) work, > but rather to verify what they're testing, so that they're valid > OpenACC code, or if not that, then they're testing specifics of the > GCC implementation (for example, the 'dg-shouldfail' test cases). Indeed, the tests looked "obviously wrong", but actually none of them should have failed with the reference-count overhaul patch. As far as I can tell, only the context-2.c test now fails with the current og9 branch, intermittently, with the last version of the patch sent. Turns out that was a real bug! So, good catch. > ACK -- but do we understand why the same shouldn't be applied to the > very similar 'libgomp.oacc-c-c++-common/context-1.c' and > 'libgomp.oacc-c-c++-common/context-3.c', too? > > I suppose your testing of the "OpenACC reference count overhaul" > tripped over these constructs? (Why just some, then?) Yeah. Just blind luck, I think. > The same pattern ('acc_copyin', 'acc_free') also appears in > 'libgomp.oacc-c-c++-common/clauses-1.c', does that also need to be > corrected? Same in 'libgomp.oacc-c-c++-common/lib-13.c' (... where > that test case actually is titled "Check acc_is_present and > acc_delete" instead of "[...] acc_free", huh), > 'libgomp.oacc-c-c++-common/lib-14.c', > 'libgomp.oacc-c-c++-common/lib-18.c'. > > Then, the 'acc_deviceptr', 'acc_unmap_data', 'acc_free' usage in > 'libgomp.oacc-c-c++-common/clauses-1.c' also seems strange, as the > respective 'acc_free' argument certainly is not (at least not > directly) a "pointer value that was returned by a call to > 'acc_malloc'". Does it make sense to (continue to) support that, > assuming that's how it's implemented internally, or should these be > corrected to valid OpenACC, too? Same in > 'libgomp.oacc-c-c++-common/present-1.c'. > > Same in 'libgomp.oacc-c-c++-common/clauses-2.c' (we 'dg-shouldfail' > earlier, but the later code should otherwise be made correct anyway). > > Several of these things again in > 'libgomp.oacc-c-c++-common/nested-1.c'. I'm not sure if *all* of those are wrong. I have a patch (forthcoming) that fixes some of the pedantically-wrong OpenACC usage, but none of the tests now regress with this version of the patch, so the urgency is gone. This version of the patch fixes the lookup_dev_1 helper function -- previously I had: static splay_tree_key lookup_dev_1 (splay_tree_node node, uintptr_t d, size_t s) { splay_tree_key k = &node->key; struct target_mem_desc *t = k->tgt; if (d >= t->tgt_start && d + s <= t->tgt_end) return k; if (node->left) return lookup_dev_1 (node->left, d, s); if (node->right) return lookup_dev_1 (node->right, d, s); return NULL; } which would never recurse into a right-hand branch if there was a left-hand node! Oops. So, device-address lookups would sometimes fail when there was a valid mapping, depending on the balance of the splay tree. (As an aside, I think calling lookup_dev unconditionally in several of the OpenACC API calls as we do is a bad idea -- it takes time linear to the number of mappings, with no way to avoid that overhead. But that's another matter.) Re-testing shows that the previously-regressing tests no longer regress, but I haven't yet made any changes to VREFCOUNT_LINK_KEY, etc. as suggested in the review of the attach/detach patch: https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01374.html OK? (ChangeLog as before.) Julian
Hi! On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com> wrote: > This patch has been broken out of the patch supporting OpenACC 2.6 manual > deep copy last posted here: > > https://gcc.gnu.org/ml/gcc-patches/2018-12/msg01084.html Thanks. Remeber to look into <https://gcc.gnu.org/PR92116> "Potential null pointer dereference in 'gomp_acc_remove_pointer'", which may be relevant here. I see you've merged in the relevant parts of my incremental patch '[WIP] OpenACC 2.6 manual deep copy support (attach/detach): adjust for "goacc_async_unmap_tgt" removal', that I included in <http://mid.mail-archive.com/yxfpftuqpakv.fsf@hertz.schwinge.homeip.net>, which tells me that I supposedly understood that part alright. ;-D > As part of developing that patch, libgomp's OpenACC reference counting > implementation proved to be somewhat inconsistent, especially when > used in combination with the deep copy support which exercises it > more thoroughly. > > So, this patch contains just the changes to reference-counting behaviour, > for ease of (re-)review. The other parts of OpenACC 2.6 manual deep > copy support are forthcoming, but some changes in this patch anticipate > that support. As we're discussing these separately, please for now remove the changes related to the 'VREFCOUNT_LINK_KEY' toggle flag, and moving 'link_key' into an union (to later be shared with 'attach_count'); <http://mid.mail-archive.com/87pniuuhkj.fsf@euler.schwinge.homeip.net>. > Tested with offloading to NVPTX, with good results (though a couple of > tests need fixing also). The testsuite changes we're discussing separately, and need to go in before this one, obviously. > OK for trunk? I haven't understood all the changes related to replacing 'dynamic_refcount' with 'virtual_refcount', getting rid of 'data_environ', the 'lookup_dev' rework, but I trust you got that right. In particular, these seem to remove special-case OpenACC code in favor of generic OMP code, which is good. A few more comments: > --- a/libgomp/libgomp.h > +++ b/libgomp/libgomp.h > typedef struct acc_dispatch_t > { > - /* This is a linked list of data mapped using the > - acc_map_data/acc_unmap_data or "acc enter data"/"acc exit data" pragmas. > - Unlike mapped_data in the goacc_thread struct, unmapping can > - happen out-of-order with respect to mapping. */ > - /* This is guarded by the lock in the "outer" struct gomp_device_descr. */ > - struct target_mem_desc *data_environ; As mentioned before, please also accordingly update the comment attached to 'acc_dispatch_t openacc' in 'struct gomp_device_descr'. That code: > -/* Free address mapping tables. MM must be locked on entry, and remains locked > - on return. */ > - > -attribute_hidden void > -gomp_free_memmap (struct splay_tree_s *mem_map) > -{ > - while (mem_map->root) > - { > - struct target_mem_desc *tgt = mem_map->root->key.tgt; > - > - splay_tree_remove (mem_map, &mem_map->root->key); > - free (tgt->array); > - free (tgt); > - } > -} ... kind-of gets inlined here: > --- a/libgomp/oacc-init.c > +++ b/libgomp/oacc-init.c > @@ -356,9 +356,13 @@ acc_shutdown_1 (acc_device_t d) > > if (walk->dev) > { > - gomp_mutex_lock (&walk->dev->lock); > - gomp_free_memmap (&walk->dev->mem_map); > - gomp_mutex_unlock (&walk->dev->lock); > + while (walk->dev->mem_map.root) > + { > + splay_tree_key k = &walk->dev->mem_map.root->key; > + gomp_remove_var (walk->dev, k); > + } > > walk->dev = NULL; > walk->base_dev = NULL; It's not obvious to me why it's OK to remove the locking? Don't all operations on the 'mem_map' have to have the device locked? Does that code now still have the previous (and expected?) "finalize" semantics (don't consider 'refcount', always unmap)? (Should we assert here that 'gomp_remove_var' always returns 'true'? And/or, if it doesn't, what does that mean then?) Or am I confused? ;-) > --- a/libgomp/oacc-mem.c > +++ b/libgomp/oacc-mem.c > @@ -427,6 +418,7 @@ acc_unmap_data (void *h) > { > struct goacc_thread *thr = goacc_thread (); > struct gomp_device_descr *acc_dev = thr->dev; > + struct splay_tree_key_s cur_node; I know it's often not the case in existing code, but when adding new code, please move definitions next to their first use. > @@ -438,12 +430,11 @@ acc_unmap_data (void *h) > acc_api_info api_info; > bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info); > > gomp_mutex_lock (&acc_dev->lock); > > - splay_tree_key n = lookup_host (acc_dev, h, 1); > - struct target_mem_desc *t; > + cur_node.host_start = (uintptr_t) h; > + cur_node.host_end = cur_node.host_start + 1; > + splay_tree_key n = splay_tree_lookup (&acc_dev->mem_map, &cur_node); > > if (!n) > { Isn't this just inlining 'lookup_host'? There may be a good reason to do that, but what is it? > @@ -451,47 +442,28 @@ acc_unmap_data (void *h) > - /* Mark for removal. */ > - n->refcount = 1; > + splay_tree_remove (&acc_dev->mem_map, n); > > - t = n->tgt; > + struct target_mem_desc *tgt = n->tgt; > > - if (t->refcount == 2) > + if (tgt->refcount > 0) > + tgt->refcount--; > + else > { > -[...] > + free (tgt->array); > + free (tgt); > } Shouldn't that be 'if (tgt->refcount > 1)' (instead of '> 0'), like in 'gomp_unref_tgt' -- or actually use that function? > > gomp_mutex_unlock (&acc_dev->lock); > > - gomp_unmap_vars (t, true); > - > if (profiling_p) > { > thr->prof_info = NULL; Hmm, I don't understand the changes leading to this, but again, I shall trust that you've got that right. Or, was that a bug in the existing code, and we don't have proper test coverage? > @@ -577,17 +551,14 @@ present_create_copy (unsigned f, void *h, size_t s, int async) > - d = tgt->to_free; > + n = lookup_host (acc_dev, h, s); > + assert (n != NULL); > + d = (void *) (n->tgt->tgt_start + n->tgt_offset + (uintptr_t) h > + - n->host_start); | return d; Again, it's not obvious to me how that is semantically equivalent to what we've returned before? > void > -gomp_acc_remove_pointer (void *h, size_t s, bool force_copyfrom, int async, > - int finalize, int mapnum) > +gomp_acc_remove_pointer (struct gomp_device_descr *acc_dev, void **hostaddrs, > + size_t *sizes, unsigned short *kinds, int async, > + bool finalize, int mapnum) > { > + switch (kind) > + { > + case GOMP_MAP_FROM: > + case GOMP_MAP_FORCE_FROM: > + case GOMP_MAP_ALWAYS_FROM: > + copyfrom = true; > + /* Fallthrough. */ > + > + case GOMP_MAP_TO_PSET: > + case GOMP_MAP_POINTER: > + case GOMP_MAP_DELETE: > + case GOMP_MAP_RELEASE: > [...] > + default: > + gomp_mutex_unlock (&acc_dev->lock); > + gomp_fatal ("gomp_acc_remove_pointer unhandled kind 0x%.2x", > + kind); Thanks for being explicit about the expected mapping kinds, etc. > - /* If running synchronously, unmap immediately. */ > - if (async < acc_async_noval) > - gomp_unmap_vars (t, true); > - else > - { > - goacc_aq aq = get_goacc_asyncqueue (async); > - gomp_unmap_vars_async (t, true, aq); As mentioned before, 'gomp_acc_remove_pointer' now "has an unused 'async' formal parameter. Is that meant to be resolved to an asyncqueue, and pass that one to 'gomp_copy_dev2host', and call 'gomp_remove_var_async' instead of 'gomp_remove_var'"? That's here: > + if (copyfrom) > + gomp_copy_dev2host (acc_dev, NULL, (void *) cur_node.host_start, > + (void *) (n->tgt->tgt_start + n->tgt_offset > + + cur_node.host_start > + - n->host_start), > + cur_node.host_end - cur_node.host_start); > + > + if (n->refcount == 0) > + gomp_remove_var (acc_dev, n); > + break; > --- a/libgomp/oacc-parallel.c > +++ b/libgomp/oacc-parallel.c > @@ -56,12 +56,29 @@ find_pointer (int pos, size_t mapnum, unsigned short *kinds) I've always been confused by this function (before known as 'find_pset'); this feels wrong, but I've never gotten to the bottom of it. I'll trust that your changes there can only improve the current situation, not worsen it. ;-) And, again, thanks for being explicit about the expected mapping kinds, etc. > @@ -745,8 +762,14 @@ GOACC_enter_exit_data (int flags_m, size_t mapnum, > } > else > { > - gomp_acc_insert_pointer (pointer, &hostaddrs[i], > - &sizes[i], &kinds[i], async); > + goacc_aq aq = get_goacc_asyncqueue (async); > + for (int j = 0; j < 2; j++) Should this magic constant '2' be derived from 'pointer' or some such? > + gomp_map_vars_async (acc_dev, aq, > + (j == 0 || pointer == 2) ? 1 : 2, > + &hostaddrs[i + j], NULL, > + &sizes[i + j], &kinds[i + j], true, > + GOMP_MAP_VARS_OPENACC_ENTER_DATA); ;-) Yuck. As requested before: "Can we get a comment added to such 'magic', please?" I just wish that eventually we'll be able to can get rid of that stuff, and just let 'gomp_map_vars' do its thing. Similar to <https://gcc.gnu.org/PR90596> "'GOACC_parallel_keyed' should use 'GOMP_MAP_VARS_TARGET'". (For avoidance of doubt, that's not your task right now.) > --- a/libgomp/target.c > +++ b/libgomp/target.c > @@ -536,7 +536,8 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, > struct target_mem_desc *tgt > = gomp_malloc (sizeof (*tgt) + sizeof (tgt->list[0]) * mapnum); > tgt->list_count = mapnum; > - tgt->refcount = pragma_kind == GOMP_MAP_VARS_ENTER_DATA ? 0 : 1; > + tgt->refcount = (pragma_kind == GOMP_MAP_VARS_ENTER_DATA > + || pragma_kind == GOMP_MAP_VARS_OPENACC_ENTER_DATA) ? 0 : 1; > tgt->device_descr = devicep; > struct gomp_coalesce_buf cbuf, *cbufp = NULL; > > @@ -1051,8 +1053,20 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, > /* If the variable from "omp target enter data" map-list was already mapped, > tgt is not needed. Otherwise tgt will be freed by gomp_unmap_vars or > gomp_exit_data. */ > - if (pragma_kind == GOMP_MAP_VARS_ENTER_DATA && tgt->refcount == 0) > - { > + if ((pragma_kind == GOMP_MAP_VARS_ENTER_DATA > + || pragma_kind == GOMP_MAP_VARS_OPENACC_ENTER_DATA) > + && tgt->refcount == 0) > + { > + /* If we're about to discard a target_mem_desc with no "structural" > + references (tgt->refcount == 0), any splay keys linked in the tgt's > + list must have their virtual refcount incremented to represent that > + "lost" reference in order to implement the semantics of the OpenACC > + "present increment" operation properly. */ > + if (pragma_kind == GOMP_MAP_VARS_OPENACC_ENTER_DATA) > + for (i = 0; i < tgt->list_count; i++) > + if (tgt->list[i].key) > + tgt->list[i].key->virtual_refcount++; > + > free (tgt); > tgt = NULL; > } So that last item is the only difference between 'GOMP_MAP_VARS_ENTER_DATA' and 'GOMP_MAP_VARS_OPENACC_ENTER_DATA'. Again I have not digested that one, but will trust you. > @@ -1310,7 +1366,7 @@ gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version, > k->tgt = tgt; > k->tgt_offset = target_table[i].start; > k->refcount = REFCOUNT_INFINITY; > - k->link_key = NULL; > + k->virtual_refcount = 0; > array->left = NULL; > array->right = NULL; > splay_tree_insert (&devicep->mem_map, array); > @@ -1342,7 +1398,7 @@ gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version, > k->tgt = tgt; > k->tgt_offset = target_var->start; > k->refcount = target_size & link_bit ? REFCOUNT_LINK : REFCOUNT_INFINITY; > - k->link_key = NULL; > + k->virtual_refcount = 0; > array->left = NULL; > array->right = NULL; > splay_tree_insert (&devicep->mem_map, array); Why no longer initialize 'link_key' here? I'd expect that always all fields of 'k' ('struct splay_tree_key_s') get initialized, so like: > @@ -2612,6 +2652,8 @@ omp_target_associate_ptr (const void *host_ptr, const void *device_ptr, > k->tgt = tgt; > k->tgt_offset = (uintptr_t) device_ptr + device_offset; > k->refcount = REFCOUNT_INFINITY; > + k->virtual_refcount = 0; > + k->u.link_key = NULL; > array->left = NULL; > array->right = NULL; > splay_tree_insert (&devicep->mem_map, array); (I haven't verified whether that's always done, please verify.) Grüße Thomas
Hi Julian! On 2019-10-21T16:14:11+0200, I wrote: > On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com> wrote: >> This patch has been broken out of the patch supporting OpenACC 2.6 manual >> deep copy last posted here: >> >> https://gcc.gnu.org/ml/gcc-patches/2018-12/msg01084.html > > Thanks. I meanwhile re-discovered that an earlier submission, <http://mid.mail-archive.com/cover.1543438190.git.julian@codesourcery.com>, had included some documentation/rationale for: > I haven't understood all the changes related to replacing > 'dynamic_refcount' with 'virtual_refcount', getting rid of > 'data_environ', the 'lookup_dev' rework, but I trust you got that right. > In particular, these seem to remove special-case OpenACC code in favor of > generic OMP code, which is good. ... these changes. Please in the future remember to refer to such existing documentation/rationale, or again include in any re-submissions, thanks. >> Tested with offloading to NVPTX, with good results I noticed that when testing with '-foffload=x86_64-intelmicemul-linux-gnu', the x86_64-pc-linux-gnu '-m32' multilib (but not default '-m64', huh) then reproducibly regresses: PASS: libgomp.c/target-link-1.c (test for excess errors) [-PASS:-]{+FAIL:+} libgomp.c/target-link-1.c execution test ..., with an un-helpful message: "offload error: process on the device 0 unexpectedly exited with code 0". So non-OpenACC code paths seem to be negatively affected in some way? Hopefully that'll go away when backing out the 'VREFCOUNT_LINK_KEY' etc. changes, as discussed elsewhere. (I can easily test patches for you, no need for you to set up Intel MIC (emulated) offloading testing.) Grüße Thomas
Hi! This is a new version of the patch which hopefully addresses all review comments. Further commentary below. On Mon, 21 Oct 2019 16:14:11 +0200 Thomas Schwinge <thomas@codesourcery.com> wrote: > On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com> > wrote: > > This patch has been broken out of the patch supporting OpenACC 2.6 > > manual deep copy last posted here: > > > > https://gcc.gnu.org/ml/gcc-patches/2018-12/msg01084.html > > Thanks. > > Remeber to look into <https://gcc.gnu.org/PR92116> "Potential null > pointer dereference in 'gomp_acc_remove_pointer'", which may be > relevant here. I've deleted the whole function (see below) so nothing to do there, I don't think, even if that code had still been live in the last version of the patch. > I see you've merged in the relevant parts of my incremental patch > '[WIP] OpenACC 2.6 manual deep copy support (attach/detach): adjust > for "goacc_async_unmap_tgt" removal', that I included in > <http://mid.mail-archive.com/yxfpftuqpakv.fsf@hertz.schwinge.homeip.net>, > which tells me that I supposedly understood that part alright. ;-D Yes I think so -- I'll add you as co-author to the ChangeLog. Apologies for the omission! > > As part of developing that patch, libgomp's OpenACC reference > > counting implementation proved to be somewhat inconsistent, > > especially when used in combination with the deep copy support > > which exercises it more thoroughly. > > > > So, this patch contains just the changes to reference-counting > > behaviour, for ease of (re-)review. The other parts of OpenACC 2.6 > > manual deep copy support are forthcoming, but some changes in this > > patch anticipate that support. > > As we're discussing these separately, please for now remove the > changes related to the 'VREFCOUNT_LINK_KEY' toggle flag, and moving > 'link_key' into an union (to later be shared with 'attach_count'); > <http://mid.mail-archive.com/87pniuuhkj.fsf@euler.schwinge.homeip.net>. Done (I have a plan for the link_key/attach_count fields, but it's not in this patch, and I'm not sure how well it'll work out yet). > > Tested with offloading to NVPTX, with good results (though a couple > > of tests need fixing also). > > The testsuite changes we're discussing separately, and need to go in > before this one, obviously. Those tests no longer regress, so no testsuite changes are strictly necessary for this patch. > > OK for trunk? > > I haven't understood all the changes related to replacing > 'dynamic_refcount' with 'virtual_refcount', getting rid of > 'data_environ', the 'lookup_dev' rework, but I trust you got that > right. In particular, these seem to remove special-case OpenACC code > in favor of generic OMP code, which is good. Yep -- the previous email you dug up included the following rationale: - reference counts in the linked memory-mapping splay tree structure can be self-checked for consistency using optional (i.e. development-only) code. This survives a libgomp test run (with offloading to nvptx), so I'm reasonably confident it's good. - the "data_environ" field in the device descriptor -- a linear linked list containing a target memory descriptor for each "acc enter data" mapping -- has been removed. This brings OpenACC closer to the OpenMP implementation for non-lexically-scoped data mapping (GOMP_target_enter_exit_data), and is potentially a performance win if lots of data is mapped in this way. - the semantics of the "dynamic_refcount" field in the splay_tree_key structure have shifted slightly, so I've renamed the field. It now represents references that are excess to those represented by actual pointers in the linked splay tree/target-memory descriptor structure. That might have been the intention before in fact, but the implementation was inconsistent. The big thing here is the auto-checking of refcounting behaviour. There were quite a few corner cases that were broken before. > A few more comments: > > > --- a/libgomp/libgomp.h > > +++ b/libgomp/libgomp.h > > > typedef struct acc_dispatch_t > > { > > - /* This is a linked list of data mapped using the > > - acc_map_data/acc_unmap_data or "acc enter data"/"acc exit > > data" pragmas. > > - Unlike mapped_data in the goacc_thread struct, unmapping can > > - happen out-of-order with respect to mapping. */ > > - /* This is guarded by the lock in the "outer" struct > > gomp_device_descr. */ > > - struct target_mem_desc *data_environ; > > As mentioned before, please also accordingly update the comment > attached to 'acc_dispatch_t openacc' in 'struct gomp_device_descr'. Done. > That code: > > > -/* Free address mapping tables. MM must be locked on entry, and > > remains locked > > - on return. */ > > - > > -attribute_hidden void > > -gomp_free_memmap (struct splay_tree_s *mem_map) > > -{ > > - while (mem_map->root) > > - { > > - struct target_mem_desc *tgt = mem_map->root->key.tgt; > > - > > - splay_tree_remove (mem_map, &mem_map->root->key); > > - free (tgt->array); > > - free (tgt); > > - } > > -} > > ... kind-of gets inlined here: > > > --- a/libgomp/oacc-init.c > > +++ b/libgomp/oacc-init.c > > @@ -356,9 +356,13 @@ acc_shutdown_1 (acc_device_t d) > > > > if (walk->dev) > > { > > - gomp_mutex_lock (&walk->dev->lock); > > - gomp_free_memmap (&walk->dev->mem_map); > > - gomp_mutex_unlock (&walk->dev->lock); > > + while (walk->dev->mem_map.root) > > + { > > + splay_tree_key k = &walk->dev->mem_map.root->key; > > + gomp_remove_var (walk->dev, k); > > + } > > > > walk->dev = NULL; > > walk->base_dev = NULL; > > It's not obvious to me why it's OK to remove the locking? Don't all > operations on the 'mem_map' have to have the device locked? You're probably right about this -- good catch. Although if the user is shutting down the device whilst it is still active (from some other thread?) it's just a case of how ugly their crash is going to be either way, I suspect! > Does that code now still have the previous (and expected?) "finalize" > semantics (don't consider 'refcount', always unmap)? (Should we > assert here that 'gomp_remove_var' always returns 'true'? And/or, if > it doesn't, what does that mean then?) Or am I confused? ;-) Yeah. The splay tree keys are removed one at a time (without paying attention to the refcounts for those), and the linked target_mem_descs are freed when their refcounts drop to zero. Hence is_tgt_unmapped won't always be true -- only when one of the linked target_mem_descs gets freed. > > --- a/libgomp/oacc-mem.c > > +++ b/libgomp/oacc-mem.c > > > @@ -427,6 +418,7 @@ acc_unmap_data (void *h) > > { > > struct goacc_thread *thr = goacc_thread (); > > struct gomp_device_descr *acc_dev = thr->dev; > > + struct splay_tree_key_s cur_node; > > I know it's often not the case in existing code, but when adding new > code, please move definitions next to their first use. Done. > > @@ -438,12 +430,11 @@ acc_unmap_data (void *h) > > acc_api_info api_info; > > bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, > > &api_info); > > gomp_mutex_lock (&acc_dev->lock); > > > > - splay_tree_key n = lookup_host (acc_dev, h, 1); > > - struct target_mem_desc *t; > > + cur_node.host_start = (uintptr_t) h; > > + cur_node.host_end = cur_node.host_start + 1; > > + splay_tree_key n = splay_tree_lookup (&acc_dev->mem_map, > > &cur_node); > > if (!n) > > { > > Isn't this just inlining 'lookup_host'? There may be a good reason > to do that, but what is it? Yeah, looks like it. I changed the code to use lookup_host. > > @@ -451,47 +442,28 @@ acc_unmap_data (void *h) > > > - /* Mark for removal. */ > > - n->refcount = 1; > > + splay_tree_remove (&acc_dev->mem_map, n); > > > > - t = n->tgt; > > + struct target_mem_desc *tgt = n->tgt; > > > > - if (t->refcount == 2) > > + if (tgt->refcount > 0) > > + tgt->refcount--; > > + else > > { > > -[...] > > + free (tgt->array); > > + free (tgt); > > } > > Shouldn't that be 'if (tgt->refcount > 1)' (instead of '> 0'), like in > 'gomp_unref_tgt' -- or actually use that function? I think you're right about the condition -- well spotted! We can't use gomp_unref_tgt here because acc_unmap_data isn't supposed to free the device memory. > > > > gomp_mutex_unlock (&acc_dev->lock); > > > > - gomp_unmap_vars (t, true); > > - > > if (profiling_p) > > { > > thr->prof_info = NULL; > > Hmm, I don't understand the changes leading to this, but again, I > shall trust that you've got that right. > > Or, was that a bug in the existing code, and we don't have proper test > coverage? I think that was a bug in the original code. > > @@ -577,17 +551,14 @@ present_create_copy (unsigned f, void *h, > > size_t s, int async) > > > - d = tgt->to_free; > > > + n = lookup_host (acc_dev, h, s); > > + assert (n != NULL); > > + d = (void *) (n->tgt->tgt_start + n->tgt_offset + > > (uintptr_t) h > > + - n->host_start); > > | return d; > > Again, it's not obvious to me how that is semantically equivalent to > what we've returned before? This is a bug fix (it's mentioned in the ChangeLog). > > void > > -gomp_acc_remove_pointer (void *h, size_t s, bool force_copyfrom, > > int async, > > - int finalize, int mapnum) > > +gomp_acc_remove_pointer (struct gomp_device_descr *acc_dev, void > > **hostaddrs, > > + size_t *sizes, unsigned short *kinds, int > > async, > > + bool finalize, int mapnum) > > { > > > + switch (kind) > > + { > > + case GOMP_MAP_FROM: > > + case GOMP_MAP_FORCE_FROM: > > + case GOMP_MAP_ALWAYS_FROM: > > + copyfrom = true; > > + /* Fallthrough. */ > > + > > + case GOMP_MAP_TO_PSET: > > + case GOMP_MAP_POINTER: > > + case GOMP_MAP_DELETE: > > + case GOMP_MAP_RELEASE: > > [...] > > + default: > > + gomp_mutex_unlock (&acc_dev->lock); > > + gomp_fatal ("gomp_acc_remove_pointer unhandled kind > > 0x%.2x", > > + kind); > > Thanks for being explicit about the expected mapping kinds, etc. That code's all gone with this version... > > - /* If running synchronously, unmap immediately. */ > > - if (async < acc_async_noval) > > - gomp_unmap_vars (t, true); > > - else > > - { > > - goacc_aq aq = get_goacc_asyncqueue (async); > > - gomp_unmap_vars_async (t, true, aq); > > As mentioned before, 'gomp_acc_remove_pointer' now "has an unused > 'async' formal parameter. Is that meant to be resolved to an > asyncqueue, and pass that one to 'gomp_copy_dev2host', and call > 'gomp_remove_var_async' instead of 'gomp_remove_var'"? That's here: Hmm yeah, that's all gone however. > > + if (copyfrom) > > + gomp_copy_dev2host (acc_dev, NULL, (void *) > > cur_node.host_start, > > + (void *) (n->tgt->tgt_start + > > n->tgt_offset > > + + cur_node.host_start > > + - n->host_start), > > + cur_node.host_end - > > cur_node.host_start); + > > + if (n->refcount == 0) > > + gomp_remove_var (acc_dev, n); > > + break; > > > --- a/libgomp/oacc-parallel.c > > +++ b/libgomp/oacc-parallel.c > > @@ -56,12 +56,29 @@ find_pointer (int pos, size_t mapnum, unsigned > > short *kinds) > > I've always been confused by this function (before known as > 'find_pset'); this feels wrong, but I've never gotten to the bottom > of it. This version removes that function in favour of a function that finds groups of consecutive mappings that should be kept together for a single gomp_map_vars invocation. I think that fits better with my findings as written up on the wiki page https://gcc.gnu.org/wiki/LibgompPointerMappingKinds. > I'll trust that your changes there can only improve the current > situation, not worsen it. ;-) > > And, again, thanks for being explicit about the expected mapping > kinds, etc. > > > @@ -745,8 +762,14 @@ GOACC_enter_exit_data (int flags_m, size_t > > mapnum, } > > else > > { > > - gomp_acc_insert_pointer (pointer, &hostaddrs[i], > > - &sizes[i], &kinds[i], > > async); > > + goacc_aq aq = get_goacc_asyncqueue (async); > > + for (int j = 0; j < 2; j++) > > Should this magic constant '2' be derived from 'pointer' or some such? > > > + gomp_map_vars_async (acc_dev, aq, > > + (j == 0 || pointer == 2) ? > > 1 : 2, > > + &hostaddrs[i + j], NULL, > > + &sizes[i + j], &kinds[i + j], > > true, > > + > > GOMP_MAP_VARS_OPENACC_ENTER_DATA); > > ;-) Yuck. As requested before: "Can we get a comment added to such > 'magic', please?" That magic is gone now. > I just wish that eventually we'll be able to can get rid of that > stuff, and just let 'gomp_map_vars' do its thing. Similar to > <https://gcc.gnu.org/PR90596> "'GOACC_parallel_keyed' should use > 'GOMP_MAP_VARS_TARGET'". > > (For avoidance of doubt, that's not your task right now.) Does this version look better? I've removed the special-case handling of pointers in the enter/exit data code, and combined the gomp_acc_remove_pointer code (which now iterated over mappings one-at-a-time anyway) with the loop iterating over mappings in the new goacc_exit_data_internal function. It was a bit nonsensical to have the "exit data" code split over two files, as before. > > --- a/libgomp/target.c > > +++ b/libgomp/target.c > > @@ -536,7 +536,8 @@ gomp_map_vars_internal (struct > > gomp_device_descr *devicep, struct target_mem_desc *tgt > > = gomp_malloc (sizeof (*tgt) + sizeof (tgt->list[0]) * mapnum); > > tgt->list_count = mapnum; > > - tgt->refcount = pragma_kind == GOMP_MAP_VARS_ENTER_DATA ? 0 : 1; > > + tgt->refcount = (pragma_kind == GOMP_MAP_VARS_ENTER_DATA > > + || pragma_kind == > > GOMP_MAP_VARS_OPENACC_ENTER_DATA) ? 0 : 1; tgt->device_descr = > > devicep; struct gomp_coalesce_buf cbuf, *cbufp = NULL; > > > > @@ -1051,8 +1053,20 @@ gomp_map_vars_internal (struct > > gomp_device_descr *devicep, /* If the variable from "omp target > > enter data" map-list was already mapped, tgt is not needed. > > Otherwise tgt will be freed by gomp_unmap_vars or gomp_exit_data. > > */ > > - if (pragma_kind == GOMP_MAP_VARS_ENTER_DATA && tgt->refcount == > > 0) > > - { > > + if ((pragma_kind == GOMP_MAP_VARS_ENTER_DATA > > + || pragma_kind == GOMP_MAP_VARS_OPENACC_ENTER_DATA) > > + && tgt->refcount == 0) > > + { > > + /* If we're about to discard a target_mem_desc with no > > "structural" > > + references (tgt->refcount == 0), any splay keys linked in > > the tgt's > > + list must have their virtual refcount incremented to > > represent that > > + "lost" reference in order to implement the semantics of > > the OpenACC > > + "present increment" operation properly. */ > > + if (pragma_kind == GOMP_MAP_VARS_OPENACC_ENTER_DATA) > > + for (i = 0; i < tgt->list_count; i++) > > + if (tgt->list[i].key) > > + tgt->list[i].key->virtual_refcount++; > > + > > free (tgt); > > tgt = NULL; > > } > > So that last item is the only difference between > 'GOMP_MAP_VARS_ENTER_DATA' and 'GOMP_MAP_VARS_OPENACC_ENTER_DATA'. > Again I have not digested that one, but will trust you. Yeah, because of the OpenACC reference counting & finalize semantics, which I don't think are applicable to OpenMP. > > @@ -1310,7 +1366,7 @@ gomp_load_image_to_device (struct > > gomp_device_descr *devicep, unsigned version, k->tgt = tgt; > > k->tgt_offset = target_table[i].start; > > k->refcount = REFCOUNT_INFINITY; > > - k->link_key = NULL; > > + k->virtual_refcount = 0; > > array->left = NULL; > > array->right = NULL; > > splay_tree_insert (&devicep->mem_map, array); > > @@ -1342,7 +1398,7 @@ gomp_load_image_to_device (struct > > gomp_device_descr *devicep, unsigned version, k->tgt = tgt; > > k->tgt_offset = target_var->start; > > k->refcount = target_size & link_bit ? REFCOUNT_LINK : > > REFCOUNT_INFINITY; > > - k->link_key = NULL; > > + k->virtual_refcount = 0; > > array->left = NULL; > > array->right = NULL; > > splay_tree_insert (&devicep->mem_map, array); > > Why no longer initialize 'link_key' here? > > I'd expect that always all fields of 'k' ('struct splay_tree_key_s') > get initialized, so like: > > > @@ -2612,6 +2652,8 @@ omp_target_associate_ptr (const void > > *host_ptr, const void *device_ptr, k->tgt = tgt; > > k->tgt_offset = (uintptr_t) device_ptr + device_offset; > > k->refcount = REFCOUNT_INFINITY; > > + k->virtual_refcount = 0; > > + k->u.link_key = NULL; > > array->left = NULL; > > array->right = NULL; > > splay_tree_insert (&devicep->mem_map, array); > > (I haven't verified whether that's always done, please verify.) This version (without the link_key union, etc.) should avoid those problems. I've added some missing initialisations, too. Re-tested with offloading to nvptx. OK for trunk? Thanks, Julian
Hi Julian! On 2019-10-29T12:15:01+0000, Julian Brown <julian@codesourcery.com> wrote: > This is a new version of the patch which hopefully addresses all review > comments. Further commentary below. Thanks, great, looking into that one -- I see you're removing more and more special-case, strange code, replacing it with generic and/or well-explained code. Question, for my understanding: > On Mon, 21 Oct 2019 16:14:11 +0200 > Thomas Schwinge <thomas@codesourcery.com> wrote: >> On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com> >> wrote: >> > @@ -577,17 +551,14 @@ present_create_copy (unsigned f, void *h, size_t s, int async) >> >> > - d = tgt->to_free; >> >> > + n = lookup_host (acc_dev, h, s); >> > + assert (n != NULL); >> > + d = (void *) (n->tgt->tgt_start + n->tgt_offset + (uintptr_t) h >> > + - n->host_start); >> >> | return d; >> >> Again, it's not obvious to me how that is semantically equivalent to >> what we've returned before? > > This is a bug fix (it's mentioned in the ChangeLog). Eh, well hidden. Indeed that mentions: (present_create_copy): [...] Fix target pointer return value. So that's not related to reference counting, needs to be discussed separately. ..., and while I do agree that the current code is a bit "strange" (returning 'tgt->to_free'), I couldn't quickly find or come up with a test cases where this would actually do the wrong thing. After all, this is the code path taken for "not present", and 'tgt' is built anew for one single mapping, with no alignment set (which would cause 'to_free' to differ from 'tgt_start'); 'tgt_offset' should always be zero, and 'h' always the same as 'host_start'. What am I missing? That is, given the current set of libgomp test cases, the attached never triggeres. Grüße Thomas
On Thu, 31 Oct 2019 19:11:57 +0100 Thomas Schwinge <thomas@codesourcery.com> wrote: > Hi Julian! > > On 2019-10-29T12:15:01+0000, Julian Brown <julian@codesourcery.com> > wrote: > > This is a new version of the patch which hopefully addresses all > > review comments. Further commentary below. > > Thanks, great, looking into that one -- I see you're removing more and > more special-case, strange code, replacing it with generic and/or > well-explained code. > > > Question, for my understanding: > > > On Mon, 21 Oct 2019 16:14:11 +0200 > > Thomas Schwinge <thomas@codesourcery.com> wrote: > >> On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com> > >> wrote: > > >> > @@ -577,17 +551,14 @@ present_create_copy (unsigned f, void *h, > >> > size_t s, int async) > >> > >> > - d = tgt->to_free; > >> > >> > + n = lookup_host (acc_dev, h, s); > >> > + assert (n != NULL); > >> > + d = (void *) (n->tgt->tgt_start + n->tgt_offset + > >> > (uintptr_t) h > >> > + - n->host_start); > >> > >> | return d; > >> > >> Again, it's not obvious to me how that is semantically equivalent > >> to what we've returned before? > > > > This is a bug fix (it's mentioned in the ChangeLog). > > Eh, well hidden. Indeed that mentions: > > (present_create_copy): [...] Fix target pointer > return value. > > So that's not related to reference counting, needs to be discussed > separately. > > ..., and while I do agree that the current code is a bit "strange" > (returning 'tgt->to_free'), I couldn't quickly find or come up with a > test cases where this would actually do the wrong thing. After all, > this is the code path taken for "not present", and 'tgt' is built > anew for one single mapping, with no alignment set (which would cause > 'to_free' to differ from 'tgt_start'); 'tgt_offset' should always be > zero, and 'h' always the same as 'host_start'. What am I missing? > That is, given the current set of libgomp test cases, the attached > never triggeres. The code can't stay exactly as it is with this patch, because the tgt return value from gomp_map_vars_async with GOMP_MAP_VARS_OPENACC_ENTER_DATA is a null pointer. So, the device pointer calculation needed to be re-done -- although it's not quite a bug fix, as you point out, and some of the offsets will always be zero or cancel out in practice. *However*, it looks like the device pointer calculation for the "present" case is wrong in the preceding code. I've addressed that in the patch posted here: https://gcc.gnu.org/ml/gcc-patches/2019-11/msg00661.html The patch attached here applies on top of that one, and attempts to keep the device pointer calculation "the same" for the non-present case, modulo an extra lookup_host -- and also adds some assertions to make sure the assumptions about zero/cancelled-out offsets stay true. OK for trunk? Re-tested with offloading to nvptx. Thanks, Julian
On Sat, 9 Nov 2019 01:28:51 +0000 Julian Brown <julian@codesourcery.com> wrote: > On Thu, 31 Oct 2019 19:11:57 +0100 > Thomas Schwinge <thomas@codesourcery.com> wrote: > > > So that's not related to reference counting, needs to be discussed > > separately. > > > > ..., and while I do agree that the current code is a bit "strange" > > (returning 'tgt->to_free'), I couldn't quickly find or come up with > > a test cases where this would actually do the wrong thing. After > > all, this is the code path taken for "not present", and 'tgt' is > > built anew for one single mapping, with no alignment set (which > > would cause 'to_free' to differ from 'tgt_start'); 'tgt_offset' > > should always be zero, and 'h' always the same as 'host_start'. > > What am I missing? That is, given the current set of libgomp test > > cases, the attached never triggeres. > > The code can't stay exactly as it is with this patch, because the tgt > return value from gomp_map_vars_async with > GOMP_MAP_VARS_OPENACC_ENTER_DATA is a null pointer. > > So, the device pointer calculation needed to be re-done -- although > it's not quite a bug fix, as you point out, and some of the offsets > will always be zero or cancel out in practice. > > *However*, it looks like the device pointer calculation for the > "present" case is wrong in the preceding code. I've addressed that in > the patch posted here: > > https://gcc.gnu.org/ml/gcc-patches/2019-11/msg00661.html > > The patch attached here applies on top of that one, and attempts to > keep the device pointer calculation "the same" for the non-present > case, modulo an extra lookup_host -- and also adds some assertions to > make sure the assumptions about zero/cancelled-out offsets stay true. Here's another iteration that applies over the version of the present/subarray patch committed, and also addresses the use of REFCOUNT_INFINITY on target blocks as queried in the following message: https://gcc.gnu.org/ml/gcc-patches/2019-11/msg01146.html Most uses of REFCOUNT_INFINITY indeed appear to be unreachable (as in, a target_mem_desc with refcount == REFCOUNT_INFINITY will most of the time be linked from a splay tree key with refcount == REFCOUNT_INFINITY, and the code to decrement the former's refcount and/or free the block will never be called). I found one case (for OpenACC) where a runtime check/error can be added -- attempting to free a mapped target block corresponding to a device_resident global variable using an API routine. I don't think there's a code path using directives (for either OpenACC or OpenMP) that exhibits any problematic behaviour in that regard. I've added a couple of test cases, and a couple of assertions. OK now? (Or perhaps the REFCOUNT_INFINITY bits want splitting out? It all still arguably comes under the "refcount overhaul" umbrella!). Thanks, Julian
Hi Julian! On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com> wrote: > --- a/libgomp/oacc-mem.c > +++ b/libgomp/oacc-mem.c > @@ -715,48 +684,34 @@ delete_copyout (unsigned f, void *h, size_t s, int async, const char *libfnname) > if (f & FLAG_COPYOUT) > [...] > gomp_copy_dev2host (acc_dev, aq, h, d, s); > } > - gomp_remove_var (acc_dev, n); > + gomp_remove_var_async (acc_dev, n, aq); Conceptually, I understand correctly that we need to use this (new) 'gomp_remove_var_async' to make sure that we don't 'gomp_free_device_memory' while the 'gomp_copy_dev2host' cited above is still in process? I'm curious why this isn't causing any problems for nvptx offloading already, any thoughts on that? Or, is this just missing test coverage? (Always difficult for 'async' stuff, of course.) By chance, is this right now already causing problems with AMD GCN offloading? (I really need to set up AMD GCN offloading testing...) I'm citing below the changes introducing 'gomp_remove_var_async', modelled similar to the existing 'gomp_unmap_vars_async'. Also for both these, do I understand correctly, that it's actually not the 'gomp_unref_tgt' that needs to be "delayed" via 'goacc_asyncqueue', but rather really only the 'gomp_free_device_memory', called via 'gomp_unmap_tgt', called via 'gomp_unref_tgt'? In other words: why do we need to keep the 'struct target_mem_desc' alive? Per my understanding, that one is one component of the mapping table, and not relevant anymore (thus can be 'free'd) as soon as it has been determined that 'tgt->refcount == 0'? Am I missing something there? It will be OK to clean that up later, but I'd like to understand this now. Well, or, stating that you just blindly copied that from the existing 'gomp_unmap_vars_async' is fine, too! ;-P Grüße Thomas > --- a/libgomp/target.c > +++ b/libgomp/target.c > @@ -1092,32 +1106,66 @@ gomp_unmap_tgt (struct target_mem_desc *tgt) > free (tgt); > } > > -attribute_hidden bool > -gomp_remove_var (struct gomp_device_descr *devicep, splay_tree_key k) > +static bool > +gomp_unref_tgt (void *ptr) > { > bool is_tgt_unmapped = false; > - splay_tree_remove (&devicep->mem_map, k); > - if (k->link_key) > - splay_tree_insert (&devicep->mem_map, (splay_tree_node) k->link_key); > - if (k->tgt->refcount > 1) > - k->tgt->refcount--; > + > + struct target_mem_desc *tgt = (struct target_mem_desc *) ptr; > + > + if (tgt->refcount > 1) > + tgt->refcount--; > else > { > + gomp_unmap_tgt (tgt); > is_tgt_unmapped = true; > - gomp_unmap_tgt (k->tgt); > } > + > return is_tgt_unmapped; > } > > static void > -gomp_unref_tgt (void *ptr) > +gomp_unref_tgt_void (void *ptr) > { > - struct target_mem_desc *tgt = (struct target_mem_desc *) ptr; > + (void) gomp_unref_tgt (ptr); > +} > > - if (tgt->refcount > 1) > - tgt->refcount--; > +static inline __attribute__((always_inline)) bool > +gomp_remove_var_internal (struct gomp_device_descr *devicep, splay_tree_key k, > + struct goacc_asyncqueue *aq) > +{ > + bool is_tgt_unmapped = false; > + splay_tree_remove (&devicep->mem_map, k); > + if (k->virtual_refcount == VREFCOUNT_LINK_KEY) > + { > + if (k->u.link_key) > + splay_tree_insert (&devicep->mem_map, (splay_tree_node) k->u.link_key); > + } > + if (aq) > + devicep->openacc.async.queue_callback_func (aq, gomp_unref_tgt_void, > + (void *) k->tgt); > else > - gomp_unmap_tgt (tgt); > + is_tgt_unmapped = gomp_unref_tgt ((void *) k->tgt); > + return is_tgt_unmapped; > +} > + > +attribute_hidden bool > +gomp_remove_var (struct gomp_device_descr *devicep, splay_tree_key k) > +{ > + return gomp_remove_var_internal (devicep, k, NULL); > +} > + > +/* Remove a variable asynchronously. This actually removes the variable > + mapping immediately, but retains the linked target_mem_desc until the > + asynchronous operation has completed (as it may still refer to target > + memory). The device lock must be held before entry, and remains locked on > + exit. */ > + > +attribute_hidden void > +gomp_remove_var_async (struct gomp_device_descr *devicep, splay_tree_key k, > + struct goacc_asyncqueue *aq) > +{ > + (void) gomp_remove_var_internal (devicep, k, aq); > }
On Mon, 9 Dec 2019 15:44:25 +0100 Thomas Schwinge <thomas@codesourcery.com> wrote: > Hi Julian! > > On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com> > wrote: > > --- a/libgomp/oacc-mem.c > > +++ b/libgomp/oacc-mem.c > > > @@ -715,48 +684,34 @@ delete_copyout (unsigned f, void *h, size_t > > s, int async, const char *libfnname) > > > if (f & FLAG_COPYOUT) > > [...] > > gomp_copy_dev2host (acc_dev, aq, h, d, s); > > } > > - gomp_remove_var (acc_dev, n); > > + gomp_remove_var_async (acc_dev, n, aq); > > Conceptually, I understand correctly that we need to use this (new) > 'gomp_remove_var_async' to make sure that we don't > 'gomp_free_device_memory' while the 'gomp_copy_dev2host' cited above > is still in process? Yep. > I'm curious why this isn't causing any problems for nvptx offloading > already, any thoughts on that? Or, is this just missing test > coverage? (Always difficult for 'async' stuff, of course.) By > chance, is this right now already causing problems with AMD GCN > offloading? (I really need to set up AMD GCN offloading testing...) In a few cases, async stuff on nvidia seems to "just work" even in cases where we wouldn't expect it to via inspection (either because the driver/hardware is doing something "magic", or because we're somehow driving async operations in such a way that they run synchronously in practice). One such case is with the "ephemeral" asynchronous host-to-device memory copy patch. The AMD side seems much more sensitive to improper async behaviour -- but I don't actually remember if I hit problems with this code in particular. > I'm citing below the changes introducing 'gomp_remove_var_async', > modelled similar to the existing 'gomp_unmap_vars_async'. > > > Also for both these, do I understand correctly, that it's actually not > the 'gomp_unref_tgt' that needs to be "delayed" via > 'goacc_asyncqueue', but rather really only the > 'gomp_free_device_memory', called via 'gomp_unmap_tgt', called via > 'gomp_unref_tgt'? In other words: why do we need to keep the 'struct > target_mem_desc' alive? Per my understanding, that one is one > component of the mapping table, and not relevant anymore (thus can be > 'free'd) as soon as it has been determined that 'tgt->refcount == > 0'? Am I missing something there? IIRC, that was Chung-Lin's choice. I'll CC him in. I think delaying freeing of the target_mem_desc isn't really a huge problem, in practice. > It will be OK to clean that up later, but I'd like to understand this > now. Well, or, stating that you just blindly copied that from the > existing 'gomp_unmap_vars_async' is fine, too! ;-P Some changes arose via the porting to AMD GCN, and some may have been drive-by fixes (e.g. where a synchronous call was used in a context where it is obvious that an asynchronous call is really needed). Like you mentioned, test coverage could probably be better, and writing reliable tests for async behaviour is challenging. Julian
Hi Julian! On 2019-12-09T15:04:15+0000, Julian Brown <julian@codesourcery.com> wrote: > On Mon, 9 Dec 2019 15:44:25 +0100 > Thomas Schwinge <thomas@codesourcery.com> wrote: >> On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com> >> wrote: >> > --- a/libgomp/oacc-mem.c >> > +++ b/libgomp/oacc-mem.c >> >> > @@ -715,48 +684,34 @@ delete_copyout (unsigned f, void *h, size_t >> > s, int async, const char *libfnname) >> >> > if (f & FLAG_COPYOUT) >> > [...] >> > gomp_copy_dev2host (acc_dev, aq, h, d, s); >> > } >> > - gomp_remove_var (acc_dev, n); >> > + gomp_remove_var_async (acc_dev, n, aq); >> >> Conceptually, I understand correctly that we need to use this (new) >> 'gomp_remove_var_async' to make sure that we don't >> 'gomp_free_device_memory' while the 'gomp_copy_dev2host' cited above >> is still in process? > > Yep. OK, so please prepare a patch changing just that, referencing PR92881: 's%gomp_remove_var%gomp_remove_var_async%' as cited above and also in 'libgomp/target.c:gomp_unmap_vars_internal' (for clarity, even though it doesn't matter in practice as that call will never 'gomp_free_device_memory'; see <http://mid.mail-archive.com/871rtg43me.fsf@euler.schwinge.homeip.net>), plus the addition of 'libgomp/target.c:gomp_remove_var_async' etc. >> I'm curious why this isn't causing any problems for nvptx offloading >> already, any thoughts on that? Or, is this just missing test >> coverage? (Always difficult for 'async' stuff, of course.) By >> chance, is this right now already causing problems with AMD GCN >> offloading? (I really need to set up AMD GCN offloading testing...) > > In a few cases, async stuff on nvidia seems to "just work" even in > cases where we wouldn't expect it to via inspection (either because the > driver/hardware is doing something "magic" Yeah, I too wondered whether there might be some such "magic" going on, to "help" users... > or because we're > somehow driving async operations in such a way that they run > synchronously in practice). Hope that's not that case. ;-) > One such case is with the "ephemeral" > asynchronous host-to-device memory copy patch. (Yeah, I still need to look into that.) > The AMD side seems much more sensitive to improper async behaviour -- > but I don't actually remember if I hit problems with this code in > particular. >> I'm citing below the changes introducing 'gomp_remove_var_async', >> modelled similar to the existing 'gomp_unmap_vars_async'. >> >> >> Also for both these, do I understand correctly, that it's actually not >> the 'gomp_unref_tgt' that needs to be "delayed" via >> 'goacc_asyncqueue', but rather really only the >> 'gomp_free_device_memory', called via 'gomp_unmap_tgt', called via >> 'gomp_unref_tgt'? In other words: why do we need to keep the 'struct >> target_mem_desc' alive? Per my understanding, that one is one >> component of the mapping table, and not relevant anymore (thus can be >> 'free'd) as soon as it has been determined that 'tgt->refcount == >> 0'? Am I missing something there? > > IIRC, that was Chung-Lin's choice. I'll CC him in. ;-) Or even mine; see 'gomp_unmap_vars_async' description and incremental patch in <https://gcc.gnu.org/ml/gcc-patches/2018-12/msg01620.html>. > I think delaying > freeing of the target_mem_desc isn't really a huge problem, in practice. It certainly isn't a problem (only small bits of host memory "delayed"), but it still isn't the most clean design. Anyway: >> It will be OK to clean that up later >> but I'd like to understand this >> now. Well, or, stating that you just blindly copied that from the >> existing 'gomp_unmap_vars_async' is fine, too! ;-P > > Some changes arose via the porting to AMD GCN, and some may have been > drive-by fixes (e.g. where a synchronous call was used in a context > where it is obvious that an asynchronous call is really needed). Please, again, for sake of easy review, always do such changes separately from whatever else you're working on. This of course will add a bit of delay during your original development, but will make review and reasoning much, much easier -- at that time, and also when someone (yourself even?) needs to look up again something from the development history. > Like > you mentioned, test coverage could probably be better, and writing > reliable tests for async behaviour is challenging. Thus we need to invent something, eventually. Not testing stuff because it's challenging is not a good excuse for shipping un-tested code. Grüße Thomas
On 2019/12/10 12:04 AM, Julian Brown wrote: >> I'm citing below the changes introducing 'gomp_remove_var_async', >> modelled similar to the existing 'gomp_unmap_vars_async'. >> >> >> Also for both these, do I understand correctly, that it's actually not >> the 'gomp_unref_tgt' that needs to be "delayed" via >> 'goacc_asyncqueue', but rather really only the >> 'gomp_free_device_memory', called via 'gomp_unmap_tgt', called via >> 'gomp_unref_tgt'? In other words: why do we need to keep the 'struct >> target_mem_desc' alive? Per my understanding, that one is one >> component of the mapping table, and not relevant anymore (thus can be >> 'free'd) as soon as it has been determined that 'tgt->refcount == >> 0'? Am I missing something there? > IIRC, that was Chung-Lin's choice. I'll CC him in. I think delaying > freeing of the target_mem_desc isn't really a huge problem, in practice. I don't clearly remember all the details. It could be possible that not asyncqueue-ifying gomp_remove_var was simply an overlook. The 'target_mem_desc' is supposed to represent the piece of device memory inside libgomp, so unref/freeing it only after all dev-to-host copying is done seems logical. Chung-Lin
Hi! On 2019-10-29T12:15:01+0000, Julian Brown <julian@codesourcery.com> wrote: > On Mon, 21 Oct 2019 16:14:11 +0200 > Thomas Schwinge <thomas@codesourcery.com> wrote: >> On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com> >> wrote: >> > void >> > -gomp_acc_remove_pointer (void *h, size_t s, bool force_copyfrom, >> > int async, >> > - int finalize, int mapnum) >> > +gomp_acc_remove_pointer (struct gomp_device_descr *acc_dev, void >> > **hostaddrs, >> > + size_t *sizes, unsigned short *kinds, int >> > async, >> > + bool finalize, int mapnum) >> > { >> > [...] > That code's all gone with this version... \o/ Yay! >> > --- a/libgomp/oacc-parallel.c >> > +++ b/libgomp/oacc-parallel.c >> > @@ -56,12 +56,29 @@ find_pointer (int pos, size_t mapnum, unsigned >> > short *kinds) >> >> I've always been confused by this function (before known as >> 'find_pset'); this feels wrong, but I've never gotten to the bottom >> of it. > > This version removes that function in favour of a function that finds > groups of consecutive mappings that should be kept together for a > single gomp_map_vars invocation. I think that fits better with my > findings as written up on the wiki page > https://gcc.gnu.org/wiki/LibgompPointerMappingKinds. \o/ Yay! >> > [...] >> >> ;-) Yuck. As requested before: "Can we get a comment added to such >> 'magic', please?" > > That magic is gone now. \o/ Yay! >> I just wish that eventually we'll be able to can get rid of that >> stuff, and just let 'gomp_map_vars' do its thing. Similar to >> <https://gcc.gnu.org/PR90596> "'GOACC_parallel_keyed' should use >> 'GOMP_MAP_VARS_TARGET'". >> >> (For avoidance of doubt, that's not your task right now.) > I've removed the special-case handling > of pointers in the enter/exit data code, and combined the > gomp_acc_remove_pointer code (which now iterated over mappings > one-at-a-time anyway) with the loop iterating over mappings in the > new goacc_exit_data_internal function. It was a bit nonsensical to have > the "exit data" code split over two files, as before. Yes, I like that very much, and we shall tackle that next intermediate step once your patch for <https://gcc.gnu.org/PR92881> "[OpenACC] In async context, need to use 'gomp_remove_var_async' instead of 'gomp_remove_var'" is done, <http://mid.mail-archive.com/87tv681tb3.fsf@euler.schwinge.homeip.net>. One thing: > libgomp/ > * oacc-parallel.c (find_pointer): Remove function. > (find_group_last, goacc_enter_data_internal, > goacc_exit_data_internal): New functions. > (GOACC_enter_exit_data): Use goacc_enter_data_internal and > goacc_exit_data_internal helper functions. It makes much sense to move all that into 'libgomp/oacc-mem.c', and as a preparational step, see attached "[OpenACC] Consolidate 'GOACC_enter_exit_data' and its helper functions in 'libgomp/oacc-mem.c'", committed to trunk in r279233. Grüße Thomas
Hi Julian! On 2019-10-29T12:15:01+0000, Julian Brown <julian@codesourcery.com> wrote: > On Mon, 21 Oct 2019 16:14:11 +0200 > Thomas Schwinge <thomas@codesourcery.com> wrote: > >> On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com> >> wrote: >> > --- a/libgomp/oacc-parallel.c >> > +++ b/libgomp/oacc-parallel.c >> > @@ -56,12 +56,29 @@ find_pointer (int pos, size_t mapnum, unsigned >> > short *kinds) >> >> I've always been confused by this function (before known as >> 'find_pset'); this feels wrong, but I've never gotten to the bottom >> of it. > > This version removes that function in favour of a function that finds > groups of consecutive mappings that should be kept together for a > single gomp_map_vars invocation. I think that fits better with my > findings as written up on the wiki page > https://gcc.gnu.org/wiki/LibgompPointerMappingKinds. :-) Please guide my trying to understand the changes there: > --- a/libgomp/oacc-parallel.c > +++ b/libgomp/oacc-parallel.c > @@ -47,23 +47,39 @@ _Static_assert (GOACC_FLAGS_UNMARSHAL (GOMP_DEVICE_HOST_FALLBACK) > "legacy GOMP_DEVICE_HOST_FALLBACK broken"); > > > -/* Returns the number of mappings associated with the pointer or pset. PSET > - have three mappings, whereas pointer have two. */ > +/* Some types of (pointer) variables use several consecutive mappings, which > + must be treated as a group for enter/exit data directives. This function > + returns the last mapping in such a group (inclusive), or POS for singleton > + mappings. */ > > static int > -find_pointer (int pos, size_t mapnum, unsigned short *kinds) > +find_group_last (int pos, size_t mapnum, unsigned short *kinds) > { > - if (pos + 1 >= mapnum) > - return 0; > + unsigned char kind0 = kinds[pos] & 0xff; > + int first_pos = pos, last_pos = pos; > > - unsigned char kind = kinds[pos+1] & 0xff; > - > - if (kind == GOMP_MAP_TO_PSET) > - return 3; > - else if (kind == GOMP_MAP_POINTER) > - return 2; > + if (kind0 == GOMP_MAP_TO_PSET) > + { > + while (pos + 1 < mapnum && (kinds[pos + 1] & 0xff) == GOMP_MAP_POINTER) > + last_pos = ++pos; > + /* We expect at least one GOMP_MAP_POINTER after a GOMP_MAP_TO_PSET. */ > + assert (last_pos > first_pos); > + } > + else > + { > + /* GOMP_MAP_ALWAYS_POINTER can only appear directly after some other > + mapping. */ > + if (pos + 1 < mapnum > + && (kinds[pos + 1] & 0xff) == GOMP_MAP_ALWAYS_POINTER) > + return pos + 1; > + > + /* We can have one or several GOMP_MAP_POINTER mappings after a to/from > + (etc.) mapping. */ > + while (pos + 1 < mapnum && (kinds[pos + 1] & 0xff) == GOMP_MAP_POINTER) > + last_pos = ++pos; > + } > > - return 0; > + return last_pos; > } So I ran a simple experiment where I did: assert (find_group_last (i, mapnum, kinds) == i + pointer); ... where 'pointer' is the current 'find_pointer' function. (That is, compare that the old and new way are doing the same things, given the current GCC code generation/test cases.) This 'assert' triggers for a few test cases: 'libgomp.oacc-fortran/allocatable-array-1.f90', 'libgomp.oacc-fortran/data-2.f90', 'libgomp.oacc-fortran/data-3.f90', 'libgomp.oacc-fortran/data-4-2.f90', 'libgomp.oacc-fortran/data-4.f90', 'libgomp.oacc-fortran/data-5.f90', 'libgomp.oacc-fortran/if-1.f90', 'libgomp.oacc-fortran/optional-data-enter-exit.f90'. (Maybe those are the only ones actually using that stuff?) I looked into the first one ('libgomp.oacc-fortran/allocatable-array-1.f90'), and for: integer, parameter :: n = 40 integer, allocatable :: ar(:,:,:) allocate (ar(1:n,0:n-1,0:n-1)) !$acc enter data copyin (ar) ... found: (gdb) print mapnum $2 = 3 (gdb) print kinds[0] $3 = 1 // GOMP_MAP_TO (gdb) print kinds[1] $4 = 773 (gdb) print kinds[1] & 0xff $5 = 5 // GOMP_MAP_TO_PSET (gdb) print kinds[2] $6 = 772 (gdb) print kinds[2] & 0xff $7 = 4 // GOMP_MAP_POINTER Current behavior: 'find_pointer (0, mapnum, kinds) == 3', so all three get mapped as one group. New behavior: 'find_group_last (0, mapnum, kinds) == 0', so the 'GOMP_MAP_TO' gets mapped alone. Then, 'find_group_last (1, mapnum, kinds) == 2', so the 'GOMP_MAP_TO_PSET', 'GOMP_MAP_POINTER' get mapped as one group. Is that intentional? Any then, compating that to 'libgomp/target.c:GOMP_target_enter_exit_data', where (aside from 'GOMP_MAP_STRUCT'; not relevant for us right now, yes?) everything always gets mapped alone: for (i = 0; i < mapnum; i++) if ((kinds[i] & 0xff) == GOMP_MAP_STRUCT) { [...] } else gomp_map_vars (devicep, 1, &hostaddrs[i], NULL, &sizes[i], &kinds[i], true, GOMP_MAP_VARS_ENTER_DATA); Is it just an "accident" that for OpenACC we were and still are going to do this differently, or is there an actual reason? I'm not objecting to changing any of that, but would like to understand this better. Grüße Thomas
On Fri, 13 Dec 2019 16:25:25 +0100 Thomas Schwinge <thomas@codesourcery.com> wrote: > Hi Julian! > > On 2019-10-29T12:15:01+0000, Julian Brown <julian@codesourcery.com> > wrote: > > static int > > -find_pointer (int pos, size_t mapnum, unsigned short *kinds) > > +find_group_last (int pos, size_t mapnum, unsigned short *kinds) > > { > > - if (pos + 1 >= mapnum) > > - return 0; > > + unsigned char kind0 = kinds[pos] & 0xff; > > + int first_pos = pos, last_pos = pos; > > > > - unsigned char kind = kinds[pos+1] & 0xff; > > - > > - if (kind == GOMP_MAP_TO_PSET) > > - return 3; > > - else if (kind == GOMP_MAP_POINTER) > > - return 2; > > + if (kind0 == GOMP_MAP_TO_PSET) > > + { > > + while (pos + 1 < mapnum && (kinds[pos + 1] & 0xff) == > > GOMP_MAP_POINTER) > > + last_pos = ++pos; > > + /* We expect at least one GOMP_MAP_POINTER after a > > GOMP_MAP_TO_PSET. */ > > + assert (last_pos > first_pos); > > + } > > + else > > + { > > + /* GOMP_MAP_ALWAYS_POINTER can only appear directly after > > some other > > + mapping. */ > > + if (pos + 1 < mapnum > > + && (kinds[pos + 1] & 0xff) == GOMP_MAP_ALWAYS_POINTER) > > + return pos + 1; > > + > > + /* We can have one or several GOMP_MAP_POINTER mappings > > after a to/from > > + (etc.) mapping. */ > > + while (pos + 1 < mapnum && (kinds[pos + 1] & 0xff) == > > GOMP_MAP_POINTER) > > + last_pos = ++pos; > > + } > > > > - return 0; > > + return last_pos; > > } > > So I ran a simple experiment where I did: > > assert (find_group_last (i, mapnum, kinds) == i + pointer); > > ... where 'pointer' is the current 'find_pointer' function. (That is, > compare that the old and new way are doing the same things, given the > current GCC code generation/test cases.) > > This 'assert' triggers for a few test cases: > 'libgomp.oacc-fortran/allocatable-array-1.f90', > 'libgomp.oacc-fortran/data-2.f90', 'libgomp.oacc-fortran/data-3.f90', > 'libgomp.oacc-fortran/data-4-2.f90', > 'libgomp.oacc-fortran/data-4.f90', 'libgomp.oacc-fortran/data-5.f90', > 'libgomp.oacc-fortran/if-1.f90', > 'libgomp.oacc-fortran/optional-data-enter-exit.f90'. (Maybe those > are the only ones actually using that stuff?) > > I looked into the first one > ('libgomp.oacc-fortran/allocatable-array-1.f90'), and for: > > integer, parameter :: n = 40 > integer, allocatable :: ar(:,:,:) > > allocate (ar(1:n,0:n-1,0:n-1)) > !$acc enter data copyin (ar) > > ... found: > > (gdb) print mapnum > $2 = 3 > (gdb) print kinds[0] > $3 = 1 // GOMP_MAP_TO > (gdb) print kinds[1] > $4 = 773 > (gdb) print kinds[1] & 0xff > $5 = 5 // GOMP_MAP_TO_PSET > (gdb) print kinds[2] > $6 = 772 > (gdb) print kinds[2] & 0xff > $7 = 4 // GOMP_MAP_POINTER > > Current behavior: 'find_pointer (0, mapnum, kinds) == 3', so all three > get mapped as one group. > > New behavior: 'find_group_last (0, mapnum, kinds) == 0', so the > 'GOMP_MAP_TO' gets mapped alone. Then, 'find_group_last (1, mapnum, > kinds) == 2', so the 'GOMP_MAP_TO_PSET', 'GOMP_MAP_POINTER' get > mapped as one group. > > Is that intentional? Yes. In a previous iteration of the refcount overhaul patch, we had the "magic" code fragment: > + for (int j = 0; j < 2; j++) > + gomp_map_vars_async (acc_dev, aq, > + (j == 0 || pointer == 2) ? 1 : 2, > + &hostaddrs[i + j], NULL, > + &sizes[i + j], &kinds[i + j], true, > + GOMP_MAP_VARS_OPENACC_ENTER_DATA); The "pointer == 3" case here will do precisely the same thing as the current iteration of the patch: pass the GOMP_MAP_TO to one gomp_map_vars_async call, and pass the GOMP_MAP_TO_PSET + GOMP_MAP_POINTER as a pair in a second call. The "pointer == 2" case (i.e. with a GOMP_MAP_TO and a GOMP_MAP_POINTER) will also handle the mappings separately in both the earlier patch iteration and this one. That's different from the current behaviour, because we don't want all three mappings to be bound together. The problematic cases of doing so might only appear with the manual deep copy patch applied also, though (and/or with the refcount-checking patch applied/enabled). (I don't remember exactly which test cases this affected, but I can check.) The GOMP_MAP_TO_PSET plus the following GOMP_MAP_POINTER mappings are treated as a group within gomp_map_vars_internal. So I'm not sure... > Any then, compating that to > 'libgomp/target.c:GOMP_target_enter_exit_data', where (aside from > 'GOMP_MAP_STRUCT'; not relevant for us right now, yes?) everything > always gets mapped alone: > > for (i = 0; i < mapnum; i++) > if ((kinds[i] & 0xff) == GOMP_MAP_STRUCT) > { [...] } > else > gomp_map_vars (devicep, 1, &hostaddrs[i], NULL, &sizes[i], > &kinds[i], true, GOMP_MAP_VARS_ENTER_DATA); > > Is it just an "accident" that for OpenACC we were and still are going > to do this differently, or is there an actual reason? ...why mapping one-at-a-time is the right thing to do here. Maybe the OpenMP version never sees GOMP_MAP_TO_PSET (or GOMP_MAP_ALWAYS_POINTER, which has a hard-wired dependency on the previous clause)? (I can try to check that too.) Thanks, Julian
On Sat, 14 Dec 2019 00:19:04 +0000 Julian Brown <julian@codesourcery.com> wrote: > On Fri, 13 Dec 2019 16:25:25 +0100 > Thomas Schwinge <thomas@codesourcery.com> wrote: > > > Hi Julian! > > > > On 2019-10-29T12:15:01+0000, Julian Brown <julian@codesourcery.com> > > wrote: > > > static int > > > -find_pointer (int pos, size_t mapnum, unsigned short *kinds) > > > +find_group_last (int pos, size_t mapnum, unsigned short *kinds) > > > { > > > - if (pos + 1 >= mapnum) > > > - return 0; > > > + unsigned char kind0 = kinds[pos] & 0xff; > > > + int first_pos = pos, last_pos = pos; > > > > > > - unsigned char kind = kinds[pos+1] & 0xff; > > > - > > > - if (kind == GOMP_MAP_TO_PSET) > > > - return 3; > > > - else if (kind == GOMP_MAP_POINTER) > > > - return 2; > > > + if (kind0 == GOMP_MAP_TO_PSET) > > > + { > > > + while (pos + 1 < mapnum && (kinds[pos + 1] & 0xff) == > > > GOMP_MAP_POINTER) > > > + last_pos = ++pos; > > > + /* We expect at least one GOMP_MAP_POINTER after a > > > GOMP_MAP_TO_PSET. */ > > > + assert (last_pos > first_pos); > > > + } > > > + else > > > + { > > > + /* GOMP_MAP_ALWAYS_POINTER can only appear directly after > > > some other > > > + mapping. */ > > > + if (pos + 1 < mapnum > > > + && (kinds[pos + 1] & 0xff) == GOMP_MAP_ALWAYS_POINTER) > > > + return pos + 1; > > > + > > > + /* We can have one or several GOMP_MAP_POINTER mappings > > > after a to/from > > > + (etc.) mapping. */ > > > + while (pos + 1 < mapnum && (kinds[pos + 1] & 0xff) == > > > GOMP_MAP_POINTER) > > > + last_pos = ++pos; > > > + } > > > > > > - return 0; > > > + return last_pos; > > > } > > > > So I ran a simple experiment where I did: > > > > assert (find_group_last (i, mapnum, kinds) == i + pointer); > > > > ... where 'pointer' is the current 'find_pointer' function. (That > > is, compare that the old and new way are doing the same things, > > given the current GCC code generation/test cases.) > > > > This 'assert' triggers for a few test cases: > > 'libgomp.oacc-fortran/allocatable-array-1.f90', > > 'libgomp.oacc-fortran/data-2.f90', > > 'libgomp.oacc-fortran/data-3.f90', > > 'libgomp.oacc-fortran/data-4-2.f90', > > 'libgomp.oacc-fortran/data-4.f90', > > 'libgomp.oacc-fortran/data-5.f90', 'libgomp.oacc-fortran/if-1.f90', > > 'libgomp.oacc-fortran/optional-data-enter-exit.f90'. (Maybe those > > are the only ones actually using that stuff?) > > > > I looked into the first one > > ('libgomp.oacc-fortran/allocatable-array-1.f90'), and for: > > > > integer, parameter :: n = 40 > > integer, allocatable :: ar(:,:,:) > > > > allocate (ar(1:n,0:n-1,0:n-1)) > > !$acc enter data copyin (ar) > > > > ... found: > > > > (gdb) print mapnum > > $2 = 3 > > (gdb) print kinds[0] > > $3 = 1 // GOMP_MAP_TO > > (gdb) print kinds[1] > > $4 = 773 > > (gdb) print kinds[1] & 0xff > > $5 = 5 // GOMP_MAP_TO_PSET > > (gdb) print kinds[2] > > $6 = 772 > > (gdb) print kinds[2] & 0xff > > $7 = 4 // GOMP_MAP_POINTER > > > > Current behavior: 'find_pointer (0, mapnum, kinds) == 3', so all > > three get mapped as one group. > > > > New behavior: 'find_group_last (0, mapnum, kinds) == 0', so the > > 'GOMP_MAP_TO' gets mapped alone. Then, 'find_group_last (1, mapnum, > > kinds) == 2', so the 'GOMP_MAP_TO_PSET', 'GOMP_MAP_POINTER' get > > mapped as one group. > > > > Is that intentional? > > Yes. In a previous iteration of the refcount overhaul patch, we had > the "magic" code fragment: > > > + for (int j = 0; j < 2; j++) > > + gomp_map_vars_async (acc_dev, aq, > > + (j == 0 || pointer == 2) ? > > 1 : 2, > > + &hostaddrs[i + j], NULL, > > + &sizes[i + j], &kinds[i + j], > > true, > > + > > GOMP_MAP_VARS_OPENACC_ENTER_DATA); > > The "pointer == 3" case here will do precisely the same thing as the > current iteration of the patch: pass the GOMP_MAP_TO to one > gomp_map_vars_async call, and pass the GOMP_MAP_TO_PSET + > GOMP_MAP_POINTER as a pair in a second call. > > The "pointer == 2" case (i.e. with a GOMP_MAP_TO and a > GOMP_MAP_POINTER) will also handle the mappings separately in both the > earlier patch iteration and this one. > > That's different from the current behaviour, because we don't want all > three mappings to be bound together. The problematic cases of doing > so might only appear with the manual deep copy patch applied also, > though (and/or with the refcount-checking patch applied/enabled). (I > don't remember exactly which test cases this affected, but I can > check.) To follow up from this: the change in this patch is really to ensure that reference counts are correct/consistent for *all* mappings at all times. Contrast the behaviour described in the following comment in the existing code (goacc_insert_pointer): /* ... Only the first mapping is considered in reference counting; the following ones implicitly follow suit. */ This is problematic with automated checking since the "hidden" mappings will have incorrect counts, and the problem becomes worse when the GOMP_MAP_ATTACH, etc. mappings are added by the manual deep copy patch. I tweaked the patch together with some debug-dumping code, and the change from "find_pointer-like" behaviour and "find_group_last-like" behaviour can be seen as follows (from deep-copy-8.c): with find_pointer: mapping group 0-4 0 : gomp_map_struct 0x7ffd5aa10ce0 4 1 : gomp_map_to 0x7ffd5aa10ce0 4 2 : gomp_map_alloc 0x7ffd5aa10ce8 8 3 : gomp_map_alloc 0x7ffd5aa10cf0 8 4 : gomp_map_alloc 0x7ffd5aa10cf8 8 mapping group 5-6 0 : gomp_map_to 0x14ee050 400 1 : gomp_map_attach 0x7ffd5aa10ce8 0 mapping group 7-8 0 : gomp_map_to 0x14ee1f0 400 1 : gomp_map_attach 0x7ffd5aa10cf0 0 mapping group 9-10 0 : gomp_map_to 0x14ee390 400 1 : gomp_map_attach 0x7ffd5aa10cf8 0 with find_group_last: mapping group 0-4 0 : gomp_map_struct 0x7ffc9011c3b0 4 1 : gomp_map_to 0x7ffc9011c3b0 4 2 : gomp_map_alloc 0x7ffc9011c3b8 8 3 : gomp_map_alloc 0x7ffc9011c3c0 8 4 : gomp_map_alloc 0x7ffc9011c3c8 8 mapping group 5-5 0 : gomp_map_to 0x10e0050 400 mapping group 6-6 0 : gomp_map_attach 0x7ffc9011c3b8 0 mapping group 7-7 0 : gomp_map_to 0x10e01f0 400 mapping group 8-8 0 : gomp_map_attach 0x7ffc9011c3c0 0 mapping group 9-9 0 : gomp_map_to 0x10e0390 400 mapping group 10-10 0 : gomp_map_attach 0x7ffc9011c3c8 0 In the former case, each grouped "gomp_map_to/gomp_map_attach" will form a single target_mem_desc. Then, goacc_exit_data_internal (or the previous code it replaces) performs unmapping one mapping (splay tree key) at a time. If any of these splay trees reference count hits zero, gomp_remove_var_async will be called, and then (I think) that grouping-together becomes problematic: the reference for the "other" splay tree key in the target_mem_desc's list gets lost. > > Any then, compating that to > > 'libgomp/target.c:GOMP_target_enter_exit_data', where (aside from > > 'GOMP_MAP_STRUCT'; not relevant for us right now, yes?) everything > > always gets mapped alone: > > > > for (i = 0; i < mapnum; i++) > > if ((kinds[i] & 0xff) == GOMP_MAP_STRUCT) > > { [...] } > > else > > gomp_map_vars (devicep, 1, &hostaddrs[i], NULL, &sizes[i], > > &kinds[i], true, GOMP_MAP_VARS_ENTER_DATA); > > > > Is it just an "accident" that for OpenACC we were and still are > > going to do this differently, or is there an actual reason? > > ...why mapping one-at-a-time is the right thing to do here. Maybe the > OpenMP version never sees GOMP_MAP_TO_PSET (or > GOMP_MAP_ALWAYS_POINTER, which has a hard-wired dependency on the > previous clause)? (I can try to check that too.) Actually it looks like GOMP_MAP_TO_PSET can occur in GOMP_target_enter_exit_data, but it seems that only a single test case exercises that (libgomp.fortran/target9.f90). I'd guess probably either way works -- either with GOMP_MAP_POINTER grouped together after a related GOMP_MAP_TO_PSET, or not. Thanks, Julian
Hi Julian! Thanks for walking me through this. On 2019-12-14T00:19:04+0000, Julian Brown <julian@codesourcery.com> wrote: > On Fri, 13 Dec 2019 16:25:25 +0100 > Thomas Schwinge <thomas@codesourcery.com> wrote: >> On 2019-10-29T12:15:01+0000, Julian Brown <julian@codesourcery.com> >> wrote: >> > static int >> > -find_pointer (int pos, size_t mapnum, unsigned short *kinds) >> > +find_group_last (int pos, size_t mapnum, unsigned short *kinds) >> > { >> > - if (pos + 1 >= mapnum) >> > - return 0; >> > + unsigned char kind0 = kinds[pos] & 0xff; >> > + int first_pos = pos, last_pos = pos; >> > >> > - unsigned char kind = kinds[pos+1] & 0xff; >> > - >> > - if (kind == GOMP_MAP_TO_PSET) >> > - return 3; >> > - else if (kind == GOMP_MAP_POINTER) >> > - return 2; >> > + if (kind0 == GOMP_MAP_TO_PSET) >> > + { >> > + while (pos + 1 < mapnum && (kinds[pos + 1] & 0xff) == GOMP_MAP_POINTER) >> > + last_pos = ++pos; >> > + /* We expect at least one GOMP_MAP_POINTER after a GOMP_MAP_TO_PSET. */ >> > + assert (last_pos > first_pos); >> > + } >> > + else >> > + { >> > + /* GOMP_MAP_ALWAYS_POINTER can only appear directly after some other >> > + mapping. */ >> > + if (pos + 1 < mapnum >> > + && (kinds[pos + 1] & 0xff) == GOMP_MAP_ALWAYS_POINTER) >> > + return pos + 1; >> > + >> > + /* We can have one or several GOMP_MAP_POINTER mappings after a to/from >> > + (etc.) mapping. */ >> > + while (pos + 1 < mapnum && (kinds[pos + 1] & 0xff) == GOMP_MAP_POINTER) >> > + last_pos = ++pos; >> > + } >> > >> > - return 0; >> > + return last_pos; >> > } Given: program test implicit none integer, parameter :: n = 64 integer :: a(n) call test_array(a) contains subroutine test_array(a) implicit none integer :: a(n) !$acc enter data copyin(a) !$acc exit data delete(a) end subroutine test_array end program test ..., we get a 'GOMP_MAP_TO' followed by a 'GOMP_MAP_POINTER'. That got us 'find_pointer () == 2', and now we get 'find_group_last (i) == i + 1' (so, the same). > In a previous iteration of the refcount overhaul patch, we had the > "magic" code fragment: > >> + for (int j = 0; j < 2; j++) >> + gomp_map_vars_async (acc_dev, aq, >> + (j == 0 || pointer == 2) ? 1 : 2, >> + &hostaddrs[i + j], NULL, >> + &sizes[i + j], &kinds[i + j], true, >> + GOMP_MAP_VARS_OPENACC_ENTER_DATA); > The "pointer == 2" case (i.e. with a GOMP_MAP_TO and a > GOMP_MAP_POINTER) So, that's the example given above. > will also handle the mappings separately in both the > earlier patch iteration ACK, given the "previous iteration" code presented above. > and this one. NACK? Given 'find_group_last (i) == i + 1', that means that 'GOMP_MAP_TO' and 'GOMP_MAP_POINTER' get mapped as one group? On the other hand, it still does match the current 'find_pointer' behavior? But what should the behavior here be: 'GOMP_MAP_TO', 'GOMP_MAP_POINTER' each separate, or as one group? Confusing stuff. :-| Grüße Thomas
On Wed, 18 Dec 2019 10:18:14 +0100 Thomas Schwinge <thomas@codesourcery.com> wrote: > Hi Julian! > > Thanks for walking me through this. > > On 2019-12-14T00:19:04+0000, Julian Brown <julian@codesourcery.com> > wrote: > > On Fri, 13 Dec 2019 16:25:25 +0100 > > Thomas Schwinge <thomas@codesourcery.com> wrote: > >> On 2019-10-29T12:15:01+0000, Julian Brown <julian@codesourcery.com> > >> wrote: > >> > static int > >> > -find_pointer (int pos, size_t mapnum, unsigned short *kinds) > >> > +find_group_last (int pos, size_t mapnum, unsigned short *kinds) > >> > { > >> > - if (pos + 1 >= mapnum) > >> > - return 0; > >> > + unsigned char kind0 = kinds[pos] & 0xff; > >> > + int first_pos = pos, last_pos = pos; > >> > > >> > - unsigned char kind = kinds[pos+1] & 0xff; > >> > - > >> > - if (kind == GOMP_MAP_TO_PSET) > >> > - return 3; > >> > - else if (kind == GOMP_MAP_POINTER) > >> > - return 2; > >> > + if (kind0 == GOMP_MAP_TO_PSET) > >> > + { > >> > + while (pos + 1 < mapnum && (kinds[pos + 1] & 0xff) == > >> > GOMP_MAP_POINTER) > >> > + last_pos = ++pos; > >> > + /* We expect at least one GOMP_MAP_POINTER after a > >> > GOMP_MAP_TO_PSET. */ > >> > + assert (last_pos > first_pos); > >> > + } > >> > + else > >> > + { > >> > + /* GOMP_MAP_ALWAYS_POINTER can only appear directly after > >> > some other > >> > + mapping. */ > >> > + if (pos + 1 < mapnum > >> > + && (kinds[pos + 1] & 0xff) == GOMP_MAP_ALWAYS_POINTER) > >> > + return pos + 1; > >> > + > >> > + /* We can have one or several GOMP_MAP_POINTER mappings > >> > after a to/from > >> > + (etc.) mapping. */ > >> > + while (pos + 1 < mapnum && (kinds[pos + 1] & 0xff) == > >> > GOMP_MAP_POINTER) > >> > + last_pos = ++pos; > >> > + } > >> > > >> > - return 0; > >> > + return last_pos; > >> > } > > Given: > > program test > implicit none > > integer, parameter :: n = 64 > integer :: a(n) > > call test_array(a) > > contains > subroutine test_array(a) > implicit none > > integer :: a(n) > > !$acc enter data copyin(a) > > !$acc exit data delete(a) > end subroutine test_array > end program test > > ..., we get a 'GOMP_MAP_TO' followed by a 'GOMP_MAP_POINTER'. That > got us 'find_pointer () == 2', and now we get 'find_group_last (i) == > i + 1' (so, the same). > > > In a previous iteration of the refcount overhaul patch, we had the > > "magic" code fragment: > > > >> + for (int j = 0; j < 2; j++) > >> + gomp_map_vars_async (acc_dev, aq, > >> + (j == 0 || pointer == 2) ? > >> 1 : 2, > >> + &hostaddrs[i + j], NULL, > >> + &sizes[i + j], &kinds[i + > >> j], true, > >> + > >> GOMP_MAP_VARS_OPENACC_ENTER_DATA); > > > The "pointer == 2" case (i.e. with a GOMP_MAP_TO and a > > GOMP_MAP_POINTER) > > So, that's the example given above. > > > will also handle the mappings separately in both the > > earlier patch iteration > > ACK, given the "previous iteration" code presented above. > > > and this one. > > NACK? Given 'find_group_last (i) == i + 1', that means that > 'GOMP_MAP_TO' and 'GOMP_MAP_POINTER' get mapped as one group? > > On the other hand, it still does match the current 'find_pointer' > behavior? > > But what should the behavior here be: 'GOMP_MAP_TO', > 'GOMP_MAP_POINTER' each separate, or as one group? > > Confusing stuff. :-| Hmm. I think that GOMP_MAP_POINTER is only intended to be used after some other mapping (TO/TOFROM/TO_PSET/etc.). In the follow-up patch supporting deep copy, this code is extended and refactored a little more: https://gcc.gnu.org/ml/gcc-patches/2019-12/msg01256.html One of the changes made there is to disallow GOMP_MAP{,_ALWAYS}_POINTER from appearing by itself. By my reading, that must be the case for GOMP_MAP_ALWAYS_POINTER because it has a hard-wired dependency on the previous mapping. GOMP_MAP_POINTER is slightly more questionable: at least according to the comment in gomp-constants.h, these are "an internal only map kind, used for pointer based array sections" -- so it's a little surprising they now reach the libgomp runtime at all. Maybe it was a mistake? The GOMP_MAP_ATTACH mapping (as in the example upthread) is different -- that one *can* appear by itself. Perhaps the difference (wrt. reference counting here) is that GOMP_MAP_POINTER refers to the same target_mem_desc as the previous (grouped-together) mapping, but GOMP_MAP_ATTACH does not (rather, referring to the location of the *pointer* to the data of a previous mapping, rather than the data itself). For GOMP_MAP_TO_PSET, a subsequent GOMP_MAP_POINTER will refer to the pointer set itself. So, same thing, and it's not problematic to group the mappings together. Anyway: thinking about it some more, I don't think any of the ways these types of mappings get grouped together should really be causing refcount-checking failures, so maybe something's wrong (at least academically) in goacc_exit_data_internal. The "real" problem with parasitical groupings is if we have multiple "enter data" mappings that get bound together in a single target_mem_desc, and are unmapped at different times: #pragma acc enter data copyin(arr1) copyin(arr2) ... #pragma acc exit data copyout(arr1) #pragma acc exit data copyout(arr2) That's clearly not what's happening here though. I will investigate further. Thanks, Julian
Hi! On 2019-12-11T18:22:00+0100, I wrote: > On 2019-10-29T12:15:01+0000, Julian Brown <julian@codesourcery.com> wrote: >> I've removed the special-case handling >> of pointers in the enter/exit data code, and combined the >> gomp_acc_remove_pointer code (which now iterated over mappings >> one-at-a-time anyway) with the loop iterating over mappings in the >> new goacc_exit_data_internal function. It was a bit nonsensical to have >> the "exit data" code split over two files, as before. > > Yes, I like that very much, and we shall tackle that next intermediate > step > One thing: > >> libgomp/ > >> * oacc-parallel.c (find_pointer): Remove function. >> (find_group_last, goacc_enter_data_internal, >> goacc_exit_data_internal): New functions. >> (GOACC_enter_exit_data): Use goacc_enter_data_internal and >> goacc_exit_data_internal helper functions. > > It makes much sense to move all that into 'libgomp/oacc-mem.c', and as a > preparational step, see attached "[OpenACC] Consolidate > 'GOACC_enter_exit_data' and its helper functions in > 'libgomp/oacc-mem.c'", committed to trunk in r279233. Working incrementally towards the goal of unifying all that mapping handling code, I did some refactoring ("No functional changes"): see the attached "[OpenACC] Refactor 'present_create_copy' into 'goacc_enter_data'", "[OpenACC] Refactor 'delete_copyout' into 'goacc_exit_data'", "[OpenACC] Refactor 'GOACC_enter_exit_data' to call 'goacc_enter_data', 'goacc_exit_data'", "[OpenACC] Refactor 'goacc_remove_pointer' interface", "[OpenACC] Refactor 'goacc_enter_data' so that it can be called from 'goacc_insert_pointer', "not present" case", "[OpenACC] Refactor 'goacc_enter_data' so that it can be called from 'goacc_insert_pointer', "present" case, and simplify"; committed to trunk in r279535, r279536, r279537, r279538, r279539, r279540. Grüße Thomas
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index 178eb600ccd..6b7ed7248a1 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -908,6 +908,10 @@ struct target_mem_desc { #define OFFSET_POINTER (~(uintptr_t) 1) #define OFFSET_STRUCT (~(uintptr_t) 2) +/* A special tag value for "virtual_refcount" in the splay_tree_key_s structure + below. */ +#define VREFCOUNT_LINK_KEY (~(uintptr_t) 0) + struct splay_tree_key_s { /* Address of the host object. */ uintptr_t host_start; @@ -919,10 +923,18 @@ struct splay_tree_key_s { uintptr_t tgt_offset; /* Reference count. */ uintptr_t refcount; - /* Dynamic reference count. */ - uintptr_t dynamic_refcount; - /* Pointer to the original mapping of "omp declare target link" object. */ - splay_tree_key link_key; + /* Reference counts beyond those that represent genuine references in the + linked splay tree key/target memory structures, e.g. for multiple OpenACC + "present increment" operations (via "acc enter data") referring to the same + host-memory block. + If set to VREFCOUNT_LINK_KEY (for OpenMP, where this field is not otherwise + needed), the union below represents a link key. */ + uintptr_t virtual_refcount; + union { + /* Pointer to the original mapping of "omp declare target link" object. + Only used for OpenMP. */ + splay_tree_key link_key; + } u; }; /* The comparison function. */ @@ -944,13 +956,6 @@ splay_compare (splay_tree_key x, splay_tree_key y) typedef struct acc_dispatch_t { - /* This is a linked list of data mapped using the - acc_map_data/acc_unmap_data or "acc enter data"/"acc exit data" pragmas. - Unlike mapped_data in the goacc_thread struct, unmapping can - happen out-of-order with respect to mapping. */ - /* This is guarded by the lock in the "outer" struct gomp_device_descr. */ - struct target_mem_desc *data_environ; - /* Execute. */ __typeof (GOMP_OFFLOAD_openacc_exec) *exec_func; @@ -1060,13 +1065,15 @@ struct gomp_device_descr enum gomp_map_vars_kind { GOMP_MAP_VARS_OPENACC, + GOMP_MAP_VARS_OPENACC_ENTER_DATA, GOMP_MAP_VARS_TARGET, GOMP_MAP_VARS_DATA, GOMP_MAP_VARS_ENTER_DATA }; -extern void gomp_acc_insert_pointer (size_t, void **, size_t *, void *, int); -extern void gomp_acc_remove_pointer (void *, size_t, bool, int, int, int); +extern void gomp_acc_remove_pointer (struct gomp_device_descr *, void **, + size_t *, unsigned short *, int, bool, + int); extern void gomp_acc_declare_allocate (bool, size_t, void **, size_t *, unsigned short *); struct gomp_coalesce_buf; @@ -1092,9 +1099,10 @@ extern void gomp_unmap_vars_async (struct target_mem_desc *, bool, struct goacc_asyncqueue *); extern void gomp_init_device (struct gomp_device_descr *); extern bool gomp_fini_device (struct gomp_device_descr *); -extern void gomp_free_memmap (struct splay_tree_s *); extern void gomp_unload_device (struct gomp_device_descr *); extern bool gomp_remove_var (struct gomp_device_descr *, splay_tree_key); +extern void gomp_remove_var_async (struct gomp_device_descr *, splay_tree_key, + struct goacc_asyncqueue *); /* work.c */ diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c index 12299aee65d..1b9adcec774 100644 --- a/libgomp/oacc-host.c +++ b/libgomp/oacc-host.c @@ -264,8 +264,6 @@ static struct gomp_device_descr host_dispatch = .state = GOMP_DEVICE_UNINITIALIZED, .openacc = { - .data_environ = NULL, - .exec_func = host_openacc_exec, .create_thread_data_func = host_openacc_create_thread_data, diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c index e1568c535b3..e0395ef43b2 100644 --- a/libgomp/oacc-init.c +++ b/libgomp/oacc-init.c @@ -356,9 +356,13 @@ acc_shutdown_1 (acc_device_t d) if (walk->dev) { - gomp_mutex_lock (&walk->dev->lock); - gomp_free_memmap (&walk->dev->mem_map); - gomp_mutex_unlock (&walk->dev->lock); + while (walk->dev->mem_map.root) + { + splay_tree_key k = &walk->dev->mem_map.root->key; + if (k->virtual_refcount == VREFCOUNT_LINK_KEY) + k->u.link_key = NULL; + gomp_remove_var (walk->dev, k); + } walk->dev = NULL; walk->base_dev = NULL; diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c index 2f271009fb8..25084b71a2d 100644 --- a/libgomp/oacc-mem.c +++ b/libgomp/oacc-mem.c @@ -50,6 +50,25 @@ lookup_host (struct gomp_device_descr *dev, void *h, size_t s) return key; } +/* Helper for lookup_dev. Iterate over splay tree. */ + +static splay_tree_key +lookup_dev_1 (splay_tree_node node, uintptr_t d, size_t s) +{ + splay_tree_key k = &node->key; + struct target_mem_desc *t = k->tgt; + + if (d >= t->tgt_start && d + s <= t->tgt_end) + return k; + + if (node->left) + return lookup_dev_1 (node->left, d, s); + if (node->right) + return lookup_dev_1 (node->right, d, s); + + return NULL; +} + /* Return block containing [D->S), or NULL if not contained. The list isn't ordered by device address, so we have to iterate over the whole array. This is not expected to be a common @@ -57,35 +76,12 @@ lookup_host (struct gomp_device_descr *dev, void *h, size_t s) remains locked on exit. */ static splay_tree_key -lookup_dev (struct target_mem_desc *tgt, void *d, size_t s) +lookup_dev (splay_tree mem_map, void *d, size_t s) { - int i; - struct target_mem_desc *t; - - if (!tgt) + if (!mem_map || !mem_map->root) return NULL; - for (t = tgt; t != NULL; t = t->prev) - { - if (t->tgt_start <= (uintptr_t) d && t->tgt_end >= (uintptr_t) d + s) - break; - } - - if (!t) - return NULL; - - for (i = 0; i < t->list_count; i++) - { - void * offset; - - splay_tree_key k = &t->array[i].key; - offset = d - t->tgt_start + k->tgt_offset; - - if (k->host_start + offset <= (void *) k->host_end) - return k; - } - - return NULL; + return lookup_dev_1 (mem_map->root, (uintptr_t) d, s); } /* OpenACC is silent on how memory exhaustion is indicated. We return @@ -150,7 +146,7 @@ acc_free (void *d) /* We don't have to call lazy open here, as the ptr value must have been returned by acc_malloc. It's not permitted to pass NULL in (unless you got that null from acc_malloc). */ - if ((k = lookup_dev (acc_dev->openacc.data_environ, d, 1))) + if ((k = lookup_dev (&acc_dev->mem_map, d, 1))) { void *offset; @@ -301,7 +297,7 @@ acc_hostptr (void *d) gomp_mutex_lock (&acc_dev->lock); - n = lookup_dev (acc_dev->openacc.data_environ, d, 1); + n = lookup_dev (&acc_dev->mem_map, d, 1); if (!n) { @@ -396,7 +392,7 @@ acc_map_data (void *h, void *d, size_t s) (int)s); } - if (lookup_dev (thr->dev->openacc.data_environ, d, s)) + if (lookup_dev (&thr->dev->mem_map, d, s)) { gomp_mutex_unlock (&acc_dev->lock); gomp_fatal ("device address [%p, +%d] is already mapped", (void *)d, @@ -415,11 +411,6 @@ acc_map_data (void *h, void *d, size_t s) thr->api_info = NULL; } } - - gomp_mutex_lock (&acc_dev->lock); - tgt->prev = acc_dev->openacc.data_environ; - acc_dev->openacc.data_environ = tgt; - gomp_mutex_unlock (&acc_dev->lock); } void @@ -427,6 +418,7 @@ acc_unmap_data (void *h) { struct goacc_thread *thr = goacc_thread (); struct gomp_device_descr *acc_dev = thr->dev; + struct splay_tree_key_s cur_node; /* No need to call lazy open, as the address must have been mapped. */ @@ -438,12 +430,11 @@ acc_unmap_data (void *h) acc_api_info api_info; bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info); - size_t host_size; - gomp_mutex_lock (&acc_dev->lock); - splay_tree_key n = lookup_host (acc_dev, h, 1); - struct target_mem_desc *t; + cur_node.host_start = (uintptr_t) h; + cur_node.host_end = cur_node.host_start + 1; + splay_tree_key n = splay_tree_lookup (&acc_dev->mem_map, &cur_node); if (!n) { @@ -451,47 +442,28 @@ acc_unmap_data (void *h) gomp_fatal ("%p is not a mapped block", (void *)h); } - host_size = n->host_end - n->host_start; - if (n->host_start != (uintptr_t) h) { + size_t host_size = n->host_end - n->host_start; gomp_mutex_unlock (&acc_dev->lock); gomp_fatal ("[%p,%d] surrounds %p", (void *) n->host_start, (int) host_size, (void *) h); } - /* Mark for removal. */ - n->refcount = 1; + splay_tree_remove (&acc_dev->mem_map, n); - t = n->tgt; + struct target_mem_desc *tgt = n->tgt; - if (t->refcount == 2) + if (tgt->refcount > 0) + tgt->refcount--; + else { - struct target_mem_desc *tp; - - /* This is the last reference, so pull the descriptor off the - chain. This avoids gomp_unmap_vars via gomp_unmap_tgt from - freeing the device memory. */ - t->tgt_end = 0; - t->to_free = 0; - - for (tp = NULL, t = acc_dev->openacc.data_environ; t != NULL; - tp = t, t = t->prev) - if (n->tgt == t) - { - if (tp) - tp->prev = t->prev; - else - acc_dev->openacc.data_environ = t->prev; - - break; - } + free (tgt->array); + free (tgt); } gomp_mutex_unlock (&acc_dev->lock); - gomp_unmap_vars (t, true); - if (profiling_p) { thr->prof_info = NULL; @@ -549,11 +521,14 @@ present_create_copy (unsigned f, void *h, size_t s, int async) gomp_fatal ("[%p,+%d] not mapped", (void *)h, (int)s); } + assert (n->virtual_refcount != VREFCOUNT_LINK_KEY); + if (n->refcount != REFCOUNT_INFINITY) { n->refcount++; - n->dynamic_refcount++; + n->virtual_refcount++; } + gomp_mutex_unlock (&acc_dev->lock); } else if (!(f & FLAG_CREATE)) @@ -563,7 +538,6 @@ present_create_copy (unsigned f, void *h, size_t s, int async) } else { - struct target_mem_desc *tgt; size_t mapnum = 1; unsigned short kinds; void *hostaddrs = h; @@ -577,17 +551,14 @@ present_create_copy (unsigned f, void *h, size_t s, int async) goacc_aq aq = get_goacc_asyncqueue (async); - tgt = gomp_map_vars_async (acc_dev, aq, mapnum, &hostaddrs, NULL, &s, - &kinds, true, GOMP_MAP_VARS_OPENACC); - /* Initialize dynamic refcount. */ - tgt->list[0].key->dynamic_refcount = 1; + gomp_map_vars_async (acc_dev, aq, mapnum, &hostaddrs, NULL, &s, &kinds, + true, GOMP_MAP_VARS_OPENACC_ENTER_DATA); gomp_mutex_lock (&acc_dev->lock); - - d = tgt->to_free; - tgt->prev = acc_dev->openacc.data_environ; - acc_dev->openacc.data_environ = tgt; - + n = lookup_host (acc_dev, h, s); + assert (n != NULL); + d = (void *) (n->tgt->tgt_start + n->tgt_offset + (uintptr_t) h + - n->host_start); gomp_mutex_unlock (&acc_dev->lock); } @@ -671,7 +642,6 @@ delete_copyout (unsigned f, void *h, size_t s, int async, const char *libfnname) { size_t host_size; splay_tree_key n; - void *d; struct goacc_thread *thr = goacc_thread (); struct gomp_device_descr *acc_dev = thr->dev; @@ -700,8 +670,7 @@ delete_copyout (unsigned f, void *h, size_t s, int async, const char *libfnname) gomp_fatal ("[%p,%d] is not mapped", (void *)h, (int)s); } - d = (void *) (n->tgt->tgt_start + n->tgt_offset - + (uintptr_t) h - n->host_start); + assert (n->virtual_refcount != VREFCOUNT_LINK_KEY); host_size = n->host_end - n->host_start; @@ -715,48 +684,34 @@ delete_copyout (unsigned f, void *h, size_t s, int async, const char *libfnname) if (n->refcount == REFCOUNT_INFINITY) { n->refcount = 0; - n->dynamic_refcount = 0; - } - if (n->refcount < n->dynamic_refcount) - { - gomp_mutex_unlock (&acc_dev->lock); - gomp_fatal ("Dynamic reference counting assert fail\n"); + n->virtual_refcount = 0; } if (f & FLAG_FINALIZE) { - n->refcount -= n->dynamic_refcount; - n->dynamic_refcount = 0; + n->refcount -= n->virtual_refcount; + n->virtual_refcount = 0; } - else if (n->dynamic_refcount) + + if (n->virtual_refcount > 0) { - n->dynamic_refcount--; n->refcount--; + n->virtual_refcount--; } + else if (n->refcount > 0) + n->refcount--; if (n->refcount == 0) { - if (n->tgt->refcount == 2) - { - struct target_mem_desc *tp, *t; - for (tp = NULL, t = acc_dev->openacc.data_environ; t != NULL; - tp = t, t = t->prev) - if (n->tgt == t) - { - if (tp) - tp->prev = t->prev; - else - acc_dev->openacc.data_environ = t->prev; - break; - } - } + goacc_aq aq = get_goacc_asyncqueue (async); if (f & FLAG_COPYOUT) - { - goacc_aq aq = get_goacc_asyncqueue (async); + { + void *d = (void *) (n->tgt->tgt_start + n->tgt_offset + + (uintptr_t) h - n->host_start); gomp_copy_dev2host (acc_dev, aq, h, d, s); } - gomp_remove_var (acc_dev, n); + gomp_remove_var_async (acc_dev, n, aq); } gomp_mutex_unlock (&acc_dev->lock); @@ -894,140 +849,80 @@ acc_update_self_async (void *h, size_t s, int async) } void -gomp_acc_insert_pointer (size_t mapnum, void **hostaddrs, size_t *sizes, - void *kinds, int async) -{ - struct target_mem_desc *tgt; - struct goacc_thread *thr = goacc_thread (); - struct gomp_device_descr *acc_dev = thr->dev; - - if (acc_is_present (*hostaddrs, *sizes)) - { - splay_tree_key n; - gomp_mutex_lock (&acc_dev->lock); - n = lookup_host (acc_dev, *hostaddrs, *sizes); - gomp_mutex_unlock (&acc_dev->lock); - - tgt = n->tgt; - for (size_t i = 0; i < tgt->list_count; i++) - if (tgt->list[i].key == n) - { - for (size_t j = 0; j < mapnum; j++) - if (i + j < tgt->list_count && tgt->list[i + j].key) - { - tgt->list[i + j].key->refcount++; - tgt->list[i + j].key->dynamic_refcount++; - } - return; - } - /* Should not reach here. */ - gomp_fatal ("Dynamic refcount incrementing failed for pointer/pset"); - } - - gomp_debug (0, " %s: prepare mappings\n", __FUNCTION__); - goacc_aq aq = get_goacc_asyncqueue (async); - tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs, - NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC); - gomp_debug (0, " %s: mappings prepared\n", __FUNCTION__); - - /* Initialize dynamic refcount. */ - tgt->list[0].key->dynamic_refcount = 1; - - gomp_mutex_lock (&acc_dev->lock); - tgt->prev = acc_dev->openacc.data_environ; - acc_dev->openacc.data_environ = tgt; - gomp_mutex_unlock (&acc_dev->lock); -} - -void -gomp_acc_remove_pointer (void *h, size_t s, bool force_copyfrom, int async, - int finalize, int mapnum) +gomp_acc_remove_pointer (struct gomp_device_descr *acc_dev, void **hostaddrs, + size_t *sizes, unsigned short *kinds, int async, + bool finalize, int mapnum) { - struct goacc_thread *thr = goacc_thread (); - struct gomp_device_descr *acc_dev = thr->dev; + struct splay_tree_key_s cur_node; splay_tree_key n; - struct target_mem_desc *t; - int minrefs = (mapnum == 1) ? 2 : 3; - - if (!acc_is_present (h, s)) - return; gomp_mutex_lock (&acc_dev->lock); - n = lookup_host (acc_dev, h, 1); - - if (!n) - { - gomp_mutex_unlock (&acc_dev->lock); - gomp_fatal ("%p is not a mapped block", (void *)h); - } - - gomp_debug (0, " %s: restore mappings\n", __FUNCTION__); - - t = n->tgt; - - if (n->refcount < n->dynamic_refcount) + for (int i = 0; i < mapnum; i++) { - gomp_mutex_unlock (&acc_dev->lock); - gomp_fatal ("Dynamic reference counting assert fail\n"); - } + int kind = kinds[i] & 0xff; + bool copyfrom = false; - if (finalize) - { - n->refcount -= n->dynamic_refcount; - n->dynamic_refcount = 0; - } - else if (n->dynamic_refcount) - { - n->dynamic_refcount--; - n->refcount--; - } + switch (kind) + { + case GOMP_MAP_FROM: + case GOMP_MAP_FORCE_FROM: + case GOMP_MAP_ALWAYS_FROM: + copyfrom = true; + /* Fallthrough. */ + + case GOMP_MAP_TO_PSET: + case GOMP_MAP_POINTER: + case GOMP_MAP_DELETE: + case GOMP_MAP_RELEASE: + cur_node.host_start = (uintptr_t) hostaddrs[i]; + cur_node.host_end = cur_node.host_start + + (kind == GOMP_MAP_POINTER + ? sizeof (void *) : sizes[i]); + n = splay_tree_lookup (&acc_dev->mem_map, &cur_node); + + if (n == NULL) + continue; + + assert (n->virtual_refcount != VREFCOUNT_LINK_KEY); + + if (n->refcount == REFCOUNT_INFINITY) + { + n->refcount = 1; + n->virtual_refcount = 0; + } - gomp_mutex_unlock (&acc_dev->lock); + if (finalize) + { + n->refcount -= n->virtual_refcount; + n->virtual_refcount = 0; + } - if (n->refcount == 0) - { - if (t->refcount == minrefs) - { - /* This is the last reference, so pull the descriptor off the - chain. This prevents gomp_unmap_vars via gomp_unmap_tgt from - freeing the device memory. */ - struct target_mem_desc *tp; - for (tp = NULL, t = acc_dev->openacc.data_environ; t != NULL; - tp = t, t = t->prev) + if (n->virtual_refcount > 0) { - if (n->tgt == t) - { - if (tp) - tp->prev = t->prev; - else - acc_dev->openacc.data_environ = t->prev; - break; - } + n->refcount--; + n->virtual_refcount--; } - } + else if (n->refcount > 0) + n->refcount--; - /* Set refcount to 1 to allow gomp_unmap_vars to unmap it. */ - n->refcount = 1; - t->refcount = minrefs; - for (size_t i = 0; i < t->list_count; i++) - if (t->list[i].key == n) - { - t->list[i].copy_from = force_copyfrom ? 1 : 0; - break; - } - - /* If running synchronously, unmap immediately. */ - if (async < acc_async_noval) - gomp_unmap_vars (t, true); - else - { - goacc_aq aq = get_goacc_asyncqueue (async); - gomp_unmap_vars_async (t, true, aq); + if (copyfrom) + gomp_copy_dev2host (acc_dev, NULL, (void *) cur_node.host_start, + (void *) (n->tgt->tgt_start + n->tgt_offset + + cur_node.host_start + - n->host_start), + cur_node.host_end - cur_node.host_start); + + if (n->refcount == 0) + gomp_remove_var (acc_dev, n); + break; + + default: + gomp_mutex_unlock (&acc_dev->lock); + gomp_fatal ("gomp_acc_remove_pointer unhandled kind 0x%.2x", + kind); } } gomp_mutex_unlock (&acc_dev->lock); - - gomp_debug (0, " %s: mappings restored\n", __FUNCTION__); } diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c index 68a60de24fa..7e72d9c6b24 100644 --- a/libgomp/oacc-parallel.c +++ b/libgomp/oacc-parallel.c @@ -56,12 +56,29 @@ find_pointer (int pos, size_t mapnum, unsigned short *kinds) if (pos + 1 >= mapnum) return 0; - unsigned char kind = kinds[pos+1] & 0xff; + unsigned char kind0 = kinds[pos] & 0xff; - if (kind == GOMP_MAP_TO_PSET) - return 3; - else if (kind == GOMP_MAP_POINTER) - return 2; + switch (kind0) + { + case GOMP_MAP_TO: + case GOMP_MAP_FORCE_TO: + case GOMP_MAP_FROM: + case GOMP_MAP_FORCE_FROM: + case GOMP_MAP_TOFROM: + case GOMP_MAP_FORCE_TOFROM: + case GOMP_MAP_ALLOC: + case GOMP_MAP_RELEASE: + { + unsigned char kind1 = kinds[pos + 1] & 0xff; + if (kind1 == GOMP_MAP_POINTER + || kind1 == GOMP_MAP_ALWAYS_POINTER) + return 2; + else if (kind1 == GOMP_MAP_TO_PSET) + return 3; + } + default: + /* empty. */; + } return 0; } @@ -745,8 +762,14 @@ GOACC_enter_exit_data (int flags_m, size_t mapnum, } else { - gomp_acc_insert_pointer (pointer, &hostaddrs[i], - &sizes[i], &kinds[i], async); + goacc_aq aq = get_goacc_asyncqueue (async); + for (int j = 0; j < 2; j++) + gomp_map_vars_async (acc_dev, aq, + (j == 0 || pointer == 2) ? 1 : 2, + &hostaddrs[i + j], NULL, + &sizes[i + j], &kinds[i + j], true, + GOMP_MAP_VARS_OPENACC_ENTER_DATA); + /* Increment 'i' by two because OpenACC requires fortran arrays to be contiguous, so each PSET is associated with one of MAP_FORCE_ALLOC/MAP_FORCE_PRESET/MAP_FORCE_TO, and @@ -791,11 +814,8 @@ GOACC_enter_exit_data (int flags_m, size_t mapnum, } else { - bool copyfrom = (kind == GOMP_MAP_FORCE_FROM - || kind == GOMP_MAP_FROM); - gomp_acc_remove_pointer (hostaddrs[i], sizes[i], copyfrom, async, - finalize, pointer); - /* See the above comment. */ + gomp_acc_remove_pointer (acc_dev, &hostaddrs[i], &sizes[i], + &kinds[i], async, finalize, pointer); i += pointer - 1; } } diff --git a/libgomp/target.c b/libgomp/target.c index a83cb48108a..b42b4ad2448 100644 --- a/libgomp/target.c +++ b/libgomp/target.c @@ -536,7 +536,8 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, struct target_mem_desc *tgt = gomp_malloc (sizeof (*tgt) + sizeof (tgt->list[0]) * mapnum); tgt->list_count = mapnum; - tgt->refcount = pragma_kind == GOMP_MAP_VARS_ENTER_DATA ? 0 : 1; + tgt->refcount = (pragma_kind == GOMP_MAP_VARS_ENTER_DATA + || pragma_kind == GOMP_MAP_VARS_OPENACC_ENTER_DATA) ? 0 : 1; tgt->device_descr = devicep; struct gomp_coalesce_buf cbuf, *cbufp = NULL; @@ -883,13 +884,14 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, kind & typemask, cbufp); else { - k->link_key = NULL; + k->u.link_key = NULL; if (n && n->refcount == REFCOUNT_LINK) { /* Replace target address of the pointer with target address of mapped object in the splay tree. */ splay_tree_remove (mem_map, n); - k->link_key = n; + k->u.link_key = n; + k->virtual_refcount = VREFCOUNT_LINK_KEY; } size_t align = (size_t) 1 << (kind >> rshift); tgt->list[i].key = k; @@ -913,7 +915,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, tgt->list[i].offset = 0; tgt->list[i].length = k->host_end - k->host_start; k->refcount = 1; - k->dynamic_refcount = 0; + k->virtual_refcount = 0; tgt->refcount++; array->left = NULL; array->right = NULL; @@ -1007,7 +1009,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, kind); } - if (k->link_key) + if (k->virtual_refcount == VREFCOUNT_LINK_KEY && k->u.link_key) { /* Set link pointer on target to the device address of the mapped object. */ @@ -1051,8 +1053,20 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep, /* If the variable from "omp target enter data" map-list was already mapped, tgt is not needed. Otherwise tgt will be freed by gomp_unmap_vars or gomp_exit_data. */ - if (pragma_kind == GOMP_MAP_VARS_ENTER_DATA && tgt->refcount == 0) - { + if ((pragma_kind == GOMP_MAP_VARS_ENTER_DATA + || pragma_kind == GOMP_MAP_VARS_OPENACC_ENTER_DATA) + && tgt->refcount == 0) + { + /* If we're about to discard a target_mem_desc with no "structural" + references (tgt->refcount == 0), any splay keys linked in the tgt's + list must have their virtual refcount incremented to represent that + "lost" reference in order to implement the semantics of the OpenACC + "present increment" operation properly. */ + if (pragma_kind == GOMP_MAP_VARS_OPENACC_ENTER_DATA) + for (i = 0; i < tgt->list_count; i++) + if (tgt->list[i].key) + tgt->list[i].key->virtual_refcount++; + free (tgt); tgt = NULL; } @@ -1092,32 +1106,66 @@ gomp_unmap_tgt (struct target_mem_desc *tgt) free (tgt); } -attribute_hidden bool -gomp_remove_var (struct gomp_device_descr *devicep, splay_tree_key k) +static bool +gomp_unref_tgt (void *ptr) { bool is_tgt_unmapped = false; - splay_tree_remove (&devicep->mem_map, k); - if (k->link_key) - splay_tree_insert (&devicep->mem_map, (splay_tree_node) k->link_key); - if (k->tgt->refcount > 1) - k->tgt->refcount--; + + struct target_mem_desc *tgt = (struct target_mem_desc *) ptr; + + if (tgt->refcount > 1) + tgt->refcount--; else { + gomp_unmap_tgt (tgt); is_tgt_unmapped = true; - gomp_unmap_tgt (k->tgt); } + return is_tgt_unmapped; } static void -gomp_unref_tgt (void *ptr) +gomp_unref_tgt_void (void *ptr) { - struct target_mem_desc *tgt = (struct target_mem_desc *) ptr; + (void) gomp_unref_tgt (ptr); +} - if (tgt->refcount > 1) - tgt->refcount--; +static inline __attribute__((always_inline)) bool +gomp_remove_var_internal (struct gomp_device_descr *devicep, splay_tree_key k, + struct goacc_asyncqueue *aq) +{ + bool is_tgt_unmapped = false; + splay_tree_remove (&devicep->mem_map, k); + if (k->virtual_refcount == VREFCOUNT_LINK_KEY) + { + if (k->u.link_key) + splay_tree_insert (&devicep->mem_map, (splay_tree_node) k->u.link_key); + } + if (aq) + devicep->openacc.async.queue_callback_func (aq, gomp_unref_tgt_void, + (void *) k->tgt); else - gomp_unmap_tgt (tgt); + is_tgt_unmapped = gomp_unref_tgt ((void *) k->tgt); + return is_tgt_unmapped; +} + +attribute_hidden bool +gomp_remove_var (struct gomp_device_descr *devicep, splay_tree_key k) +{ + return gomp_remove_var_internal (devicep, k, NULL); +} + +/* Remove a variable asynchronously. This actually removes the variable + mapping immediately, but retains the linked target_mem_desc until the + asynchronous operation has completed (as it may still refer to target + memory). The device lock must be held before entry, and remains locked on + exit. */ + +attribute_hidden void +gomp_remove_var_async (struct gomp_device_descr *devicep, splay_tree_key k, + struct goacc_asyncqueue *aq) +{ + (void) gomp_remove_var_internal (devicep, k, aq); } /* Unmap variables described by TGT. If DO_COPYFROM is true, copy relevant @@ -1153,7 +1201,15 @@ gomp_unmap_vars_internal (struct target_mem_desc *tgt, bool do_copyfrom, continue; bool do_unmap = false; - if (k->refcount > 1 && k->refcount != REFCOUNT_INFINITY) + if (k->tgt == tgt + && k->virtual_refcount > 0 + && k->virtual_refcount != VREFCOUNT_LINK_KEY + && k->refcount != REFCOUNT_INFINITY) + { + k->virtual_refcount--; + k->refcount--; + } + else if (k->refcount > 1 && k->refcount != REFCOUNT_INFINITY) k->refcount--; else if (k->refcount == 1) { @@ -1173,7 +1229,7 @@ gomp_unmap_vars_internal (struct target_mem_desc *tgt, bool do_copyfrom, } if (aq) - devicep->openacc.async.queue_callback_func (aq, gomp_unref_tgt, + devicep->openacc.async.queue_callback_func (aq, gomp_unref_tgt_void, (void *) tgt); else gomp_unref_tgt ((void *) tgt); @@ -1310,7 +1366,7 @@ gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version, k->tgt = tgt; k->tgt_offset = target_table[i].start; k->refcount = REFCOUNT_INFINITY; - k->link_key = NULL; + k->virtual_refcount = 0; array->left = NULL; array->right = NULL; splay_tree_insert (&devicep->mem_map, array); @@ -1342,7 +1398,7 @@ gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version, k->tgt = tgt; k->tgt_offset = target_var->start; k->refcount = target_size & link_bit ? REFCOUNT_LINK : REFCOUNT_INFINITY; - k->link_key = NULL; + k->virtual_refcount = 0; array->left = NULL; array->right = NULL; splay_tree_insert (&devicep->mem_map, array); @@ -1576,22 +1632,6 @@ gomp_unload_device (struct gomp_device_descr *devicep) } } -/* Free address mapping tables. MM must be locked on entry, and remains locked - on return. */ - -attribute_hidden void -gomp_free_memmap (struct splay_tree_s *mem_map) -{ - while (mem_map->root) - { - struct target_mem_desc *tgt = mem_map->root->key.tgt; - - splay_tree_remove (mem_map, &mem_map->root->key); - free (tgt->array); - free (tgt); - } -} - /* Host fallback for GOMP_target{,_ext} routines. */ static void @@ -2073,9 +2113,9 @@ gomp_exit_data (struct gomp_device_descr *devicep, size_t mapnum, if (k->refcount == 0) { splay_tree_remove (&devicep->mem_map, k); - if (k->link_key) + if (k->virtual_refcount == VREFCOUNT_LINK_KEY && k->u.link_key) splay_tree_insert (&devicep->mem_map, - (splay_tree_node) k->link_key); + (splay_tree_node) k->u.link_key); if (k->tgt->refcount > 1) k->tgt->refcount--; else @@ -2612,6 +2652,8 @@ omp_target_associate_ptr (const void *host_ptr, const void *device_ptr, k->tgt = tgt; k->tgt_offset = (uintptr_t) device_ptr + device_offset; k->refcount = REFCOUNT_INFINITY; + k->virtual_refcount = 0; + k->u.link_key = NULL; array->left = NULL; array->right = NULL; splay_tree_insert (&devicep->mem_map, array); @@ -2882,7 +2924,6 @@ gomp_target_init (void) current_device.type = current_device.get_type_func (); current_device.mem_map.root = NULL; current_device.state = GOMP_DEVICE_UNINITIALIZED; - current_device.openacc.data_environ = NULL; for (i = 0; i < new_num_devices; i++) { current_device.target_id = i; diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/context-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/context-2.c index 6a52f746dcb..6bdcfe7d429 100644 --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/context-2.c +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/context-2.c @@ -182,13 +182,13 @@ main (int argc, char **argv) exit (EXIT_FAILURE); } + acc_delete (&h_X[0], N * sizeof (float)); + acc_delete (&h_Y1[0], N * sizeof (float)); + free (h_X); free (h_Y1); free (h_Y2); - acc_free (d_X); - acc_free (d_Y); - context_check (pctx); s = cublasDestroy (h); diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/context-4.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/context-4.c index 71365e8ed32..b403a5cf5cb 100644 --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/context-4.c +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/context-4.c @@ -176,13 +176,13 @@ main (int argc, char **argv) exit (EXIT_FAILURE); } + acc_delete (&h_X[0], N * sizeof (float)); + acc_delete (&h_Y1[0], N * sizeof (float)); + free (h_X); free (h_Y1); free (h_Y2); - acc_free (d_X); - acc_free (d_Y); - context_check (pctx); s = cublasDestroy (h); diff --git a/libgomp/testsuite/libgomp.oacc-fortran/data-2.f90 b/libgomp/testsuite/libgomp.oacc-fortran/data-2.f90 index 83a540070e6..6bb92c12ed1 100644 --- a/libgomp/testsuite/libgomp.oacc-fortran/data-2.f90 +++ b/libgomp/testsuite/libgomp.oacc-fortran/data-2.f90 @@ -1,4 +1,5 @@ ! { dg-do run } +! { dg-additional-options "-cpp" } program test use openacc @@ -70,10 +71,14 @@ program test end do !$acc end parallel - !$acc exit data copyout (d(1:N)) async + !$acc exit data delete (c(1:N)) copyout (d(1:N)) async !$acc exit data async !$acc wait +#if !ACC_MEM_SHARED + if (acc_is_present (c) .eqv. .TRUE.) call abort +#endif + do i = 1, N if (d(i) .ne. 4.0) call abort end do