Message ID | 20230920114405.134100-1-julian@codesourcery.com |
---|---|
State | New |
Headers | show |
Series | OpenMP: Support accelerated 2D/3D memory copies for AMD GCN | expand |
Hi Julian, On 20.09.23 13:44, Julian Brown wrote: > This patch adds support for 2D/3D memory copies for omp_target_memcpy_rect > and "target update", using AMD extensions to the HSA API. I've just > committed a version of this patch to the og13 branch, but this is the > mainline version. > > Support is also added for 1-dimensional strided accesses: these are > treated as a special case of 2-dimensional transfers, where the innermost > dimension is formed from the stride length (in bytes). > > This patch has (somewhat awkwardly from a review perspective) been merged > on top of the following list of in-review series: > > "OpenMP/OpenACC: map clause and OMP gimplify rework": > https://gcc.gnu.org/pipermail/gcc-patches/2023-August/627895.html [That's now in mainline.] > "OpenMP: lvalue parsing and "declare mapper" support": > https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629363.html [Review in progress for the lvalue patches (C++; then C); mapper part should not be required for this patch.] > "OpenMP: Array-shaping operator and strided/rectangular 'target update' > [support": > https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629422.html Here, 2/5 and 3/5 are required (and 1/5 is committed). [4+5/5 are make it more useful but should not be required for this patch] > "OpenMP: Enable 'declare mapper' mappers for 'target update' directives": > https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629432.html I think this patch is not required for that patch, but obviously still useful. * * * I think it makes sense to split this patch into two parts: * Thelibgomp/plugin/plugin-gcn.c – which is independent and would already used by omp_memcpy_rect. * The libgomp/target.c which depends on otherpatches and adds 1D strided transfer support. I will ignore the latter change in this review. And add a note to: https://gcc.gnu.org/wiki/openmpPendingPatches to ensure that it will eventually be reviewed and added. * * * > Though it only depends directly on parts of that work (regarding > strided/rectangular updates). A stand-alone version that just works > for the OpenMP API routine omp_target_memcpy_rect could be prepared to > apply separately, if preferable. > > This version has been re-tested and bootstrapped. OK? > > 2023-09-20 Julian Brown <julian@codesourcery.com> > > libgomp/ > * plugin/plugin-gcn.c (hsa_runtime_fn_info): Add > hsa_amd_memory_lock_fn, hsa_amd_memory_unlock_fn, > hsa_amd_memory_async_copy_rect_fn function pointers. > (init_hsa_runtime_functions): Add above functions, with > DLSYM_OPT_FN. > (GOMP_OFFLOAD_memcpy2d, GOMP_OFFLOAD_memcpy3d): New functions. > * target.c (omp_target_memcpy_rect_worker): Add 1D strided transfer > support. > --- > libgomp/plugin/plugin-gcn.c | 359 ++++++++++++++++++++++++++++++++++++ The plugin change LGTM. Thanks. > libgomp/target.c | 31 ++++ I defer the review this part until the required other patches are in. Cf. https://gcc.gnu.org/wiki/openmpPendingPatches Thanks, Tobias > diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c > index ef22d48da79..95c0a57e792 100644 > --- a/libgomp/plugin/plugin-gcn.c > +++ b/libgomp/plugin/plugin-gcn.c > @@ -196,6 +196,16 @@ struct hsa_runtime_fn_info > hsa_status_t (*hsa_code_object_deserialize_fn) > (void *serialized_code_object, size_t serialized_code_object_size, > const char *options, hsa_code_object_t *code_object); > + hsa_status_t (*hsa_amd_memory_lock_fn) > + (void *host_ptr, size_t size, hsa_agent_t *agents, int num_agent, > + void **agent_ptr); > + hsa_status_t (*hsa_amd_memory_unlock_fn) (void *host_ptr); > + hsa_status_t (*hsa_amd_memory_async_copy_rect_fn) > + (const hsa_pitched_ptr_t *dst, const hsa_dim3_t *dst_offset, > + const hsa_pitched_ptr_t *src, const hsa_dim3_t *src_offset, > + const hsa_dim3_t *range, hsa_agent_t copy_agent, > + hsa_amd_copy_direction_t dir, uint32_t num_dep_signals, > + const hsa_signal_t *dep_signals, hsa_signal_t completion_signal); > }; > > /* Structure describing the run-time and grid properties of an HSA kernel > @@ -1398,6 +1408,9 @@ init_hsa_runtime_functions (void) > DLSYM_FN (hsa_signal_load_acquire) > DLSYM_FN (hsa_queue_destroy) > DLSYM_FN (hsa_code_object_deserialize) > + DLSYM_OPT_FN (hsa_amd_memory_lock) > + DLSYM_OPT_FN (hsa_amd_memory_unlock) > + DLSYM_OPT_FN (hsa_amd_memory_async_copy_rect) > return true; > #undef DLSYM_FN > } > @@ -3790,6 +3803,352 @@ GOMP_OFFLOAD_dev2dev (int device, void *dst, const void *src, size_t n) > return true; > } > > +/* Here <quantity>_size refers to <quantity> multiplied by size -- i.e. > + measured in bytes. So we have: > + > + dim1_size: number of bytes to copy on innermost dimension ("row") > + dim0_len: number of rows to copy > + dst: base pointer for destination of copy > + dst_offset1_size: innermost row offset (for dest), in bytes > + dst_offset0_len: offset, number of rows (for dest) > + dst_dim1_size: whole-array dest row length, in bytes (pitch) > + src: base pointer for source of copy > + src_offset1_size: innermost row offset (for source), in bytes > + src_offset0_len: offset, number of rows (for source) > + src_dim1_size: whole-array source row length, in bytes (pitch) > +*/ > + > +int > +GOMP_OFFLOAD_memcpy2d (int dst_ord, int src_ord, size_t dim1_size, > + size_t dim0_len, void *dst, size_t dst_offset1_size, > + size_t dst_offset0_len, size_t dst_dim1_size, > + const void *src, size_t src_offset1_size, > + size_t src_offset0_len, size_t src_dim1_size) > +{ > + if (!hsa_fns.hsa_amd_memory_lock_fn > + || !hsa_fns.hsa_amd_memory_unlock_fn > + || !hsa_fns.hsa_amd_memory_async_copy_rect_fn) > + return -1; > + > + /* GCN hardware requires 4-byte alignment for base addresses & pitches. Bail > + out quietly if we have anything oddly-aligned rather than letting the > + driver raise an error. */ > + if ((((uintptr_t) dst) & 3) != 0 || (((uintptr_t) src) & 3) != 0) > + return -1; > + > + if ((dst_dim1_size & 3) != 0 || (src_dim1_size & 3) != 0) > + return -1; > + > + /* Only handle host to device or device to host transfers here. */ > + if ((dst_ord == -1 && src_ord == -1) > + || (dst_ord != -1 && src_ord != -1)) > + return -1; > + > + hsa_amd_copy_direction_t dir > + = (src_ord == -1) ? hsaHostToDevice : hsaDeviceToHost; > + hsa_agent_t copy_agent; > + > + /* We need to pin (lock) host memory before we start the transfer. Try to > + lock the minimum size necessary, i.e. using partial first/last rows of the > + whole array. Something like this: > + > + rows --> > + .............. > + c | ..#######+++++ <- first row apart from {src,dst}_offset1_size > + o | ++#######+++++ <- whole row > + l | ++#######+++++ <- " > + s v ++#######..... <- last row apart from trailing remainder > + .............. > + > + We could split very large transfers into several rectangular copies, but > + that is unimplemented for now. */ > + > + size_t bounded_size_host, first_elem_offset_host; > + void *host_ptr; > + if (dir == hsaHostToDevice) > + { > + bounded_size_host = src_dim1_size * (dim0_len - 1) + dim1_size; > + first_elem_offset_host = src_offset0_len * src_dim1_size > + + src_offset1_size; > + host_ptr = (void *) src; > + struct agent_info *agent = get_agent_info (dst_ord); > + copy_agent = agent->id; > + } > + else > + { > + bounded_size_host = dst_dim1_size * (dim0_len - 1) + dim1_size; > + first_elem_offset_host = dst_offset0_len * dst_dim1_size > + + dst_offset1_size; > + host_ptr = dst; > + struct agent_info *agent = get_agent_info (src_ord); > + copy_agent = agent->id; > + } > + > + void *agent_ptr; > + > + hsa_status_t status > + = hsa_fns.hsa_amd_memory_lock_fn (host_ptr + first_elem_offset_host, > + bounded_size_host, NULL, 0, &agent_ptr); > + /* We can't lock the host memory: don't give up though, we might still be > + able to use the slow path in our caller. So, don't make this an > + error. */ > + if (status != HSA_STATUS_SUCCESS) > + return -1; > + > + hsa_pitched_ptr_t dstpp, srcpp; > + hsa_dim3_t dst_offsets, src_offsets, ranges; > + > + int retval = 1; > + > + hsa_signal_t completion_signal; > + status = hsa_fns.hsa_signal_create_fn (1, 0, NULL, &completion_signal); > + if (status != HSA_STATUS_SUCCESS) > + { > + retval = -1; > + goto unlock; > + } > + > + if (dir == hsaHostToDevice) > + { > + srcpp.base = agent_ptr - first_elem_offset_host; > + dstpp.base = dst; > + } > + else > + { > + srcpp.base = (void *) src; > + dstpp.base = agent_ptr - first_elem_offset_host; > + } > + > + srcpp.pitch = src_dim1_size; > + srcpp.slice = 0; > + > + src_offsets.x = src_offset1_size; > + src_offsets.y = src_offset0_len; > + src_offsets.z = 0; > + > + dstpp.pitch = dst_dim1_size; > + dstpp.slice = 0; > + > + dst_offsets.x = dst_offset1_size; > + dst_offsets.y = dst_offset0_len; > + dst_offsets.z = 0; > + > + ranges.x = dim1_size; > + ranges.y = dim0_len; > + ranges.z = 1; > + > + status > + = hsa_fns.hsa_amd_memory_async_copy_rect_fn (&dstpp, &dst_offsets, &srcpp, > + &src_offsets, &ranges, > + copy_agent, dir, 0, NULL, > + completion_signal); > + /* If the rectangular copy fails, we might still be able to use the slow > + path. We need to unlock the host memory though, so don't return > + immediately. */ > + if (status != HSA_STATUS_SUCCESS) > + retval = -1; > + else > + hsa_fns.hsa_signal_wait_acquire_fn (completion_signal, > + HSA_SIGNAL_CONDITION_LT, 1, UINT64_MAX, > + HSA_WAIT_STATE_ACTIVE); > + > + hsa_fns.hsa_signal_destroy_fn (completion_signal); > + > +unlock: > + status = hsa_fns.hsa_amd_memory_unlock_fn (host_ptr + first_elem_offset_host); > + if (status != HSA_STATUS_SUCCESS) > + hsa_fatal ("Could not unlock host memory", status); > + > + return retval; > +} > + > +/* As above, <quantity>_size refers to <quantity> multiplied by size -- i.e. > + measured in bytes. So we have: > + > + dim2_size: number of bytes to copy on innermost dimension ("row") > + dim1_len: number of rows per slice to copy > + dim0_len: number of slices to copy > + dst: base pointer for destination of copy > + dst_offset2_size: innermost row offset (for dest), in bytes > + dst_offset1_len: offset, number of rows (for dest) > + dst_offset0_len: offset, number of slices (for dest) > + dst_dim2_size: whole-array dest row length, in bytes (pitch) > + dst_dim1_len: whole-array number of rows in slice (for dest) > + src: base pointer for source of copy > + src_offset2_size: innermost row offset (for source), in bytes > + src_offset1_len: offset, number of rows (for source) > + src_offset0_len: offset, number of slices (for source) > + src_dim2_size: whole-array source row length, in bytes (pitch) > + src_dim1_len: whole-array number of rows in slice (for source) > +*/ > + > +int > +GOMP_OFFLOAD_memcpy3d (int dst_ord, int src_ord, size_t dim2_size, > + size_t dim1_len, size_t dim0_len, void *dst, > + size_t dst_offset2_size, size_t dst_offset1_len, > + size_t dst_offset0_len, size_t dst_dim2_size, > + size_t dst_dim1_len, const void *src, > + size_t src_offset2_size, size_t src_offset1_len, > + size_t src_offset0_len, size_t src_dim2_size, > + size_t src_dim1_len) > +{ > + if (!hsa_fns.hsa_amd_memory_lock_fn > + || !hsa_fns.hsa_amd_memory_unlock_fn > + || !hsa_fns.hsa_amd_memory_async_copy_rect_fn) > + return -1; > + > + /* GCN hardware requires 4-byte alignment for base addresses & pitches. Bail > + out quietly if we have anything oddly-aligned rather than letting the > + driver raise an error. */ > + if ((((uintptr_t) dst) & 3) != 0 || (((uintptr_t) src) & 3) != 0) > + return -1; > + > + if ((dst_dim2_size & 3) != 0 || (src_dim2_size & 3) != 0) > + return -1; > + > + /* Only handle host to device or device to host transfers here. */ > + if ((dst_ord == -1 && src_ord == -1) > + || (dst_ord != -1 && src_ord != -1)) > + return -1; > + > + hsa_amd_copy_direction_t dir > + = (src_ord == -1) ? hsaHostToDevice : hsaDeviceToHost; > + hsa_agent_t copy_agent; > + > + /* We need to pin (lock) host memory before we start the transfer. Try to > + lock the minimum size necessary, i.e. using partial first/last slices of > + the whole 3D array. Something like this: > + > + slice 0: slice 1: slice 2: > + __________ __________ __________ > + ^ /+++++++++/ : /+++++++++/ : / / > + column /+++##++++/| | /+++##++++/| | /+++## / # = subarray > + / / ##++++/ | |/+++##++++/ | |/+++##++++/ + = area to pin > + /_________/ : /_________/ : /_________/ > + row ---> > + > + We could split very large transfers into several rectangular copies, but > + that is unimplemented for now. */ > + > + size_t bounded_size_host, first_elem_offset_host; > + void *host_ptr; > + if (dir == hsaHostToDevice) > + { > + size_t slice_bytes = src_dim2_size * src_dim1_len; > + bounded_size_host = slice_bytes * (dim0_len - 1) > + + src_dim2_size * (dim1_len - 1) > + + dim2_size; > + first_elem_offset_host = src_offset0_len * slice_bytes > + + src_offset1_len * src_dim2_size > + + src_offset2_size; > + host_ptr = (void *) src; > + struct agent_info *agent = get_agent_info (dst_ord); > + copy_agent = agent->id; > + } > + else > + { > + size_t slice_bytes = dst_dim2_size * dst_dim1_len; > + bounded_size_host = slice_bytes * (dim0_len - 1) > + + dst_dim2_size * (dim1_len - 1) > + + dim2_size; > + first_elem_offset_host = dst_offset0_len * slice_bytes > + + dst_offset1_len * dst_dim2_size > + + dst_offset2_size; > + host_ptr = dst; > + struct agent_info *agent = get_agent_info (src_ord); > + copy_agent = agent->id; > + } > + > + void *agent_ptr; > + > + hsa_status_t status > + = hsa_fns.hsa_amd_memory_lock_fn (host_ptr + first_elem_offset_host, > + bounded_size_host, NULL, 0, &agent_ptr); > + /* We can't lock the host memory: don't give up though, we might still be > + able to use the slow path in our caller (maybe even with iterated memcpy2d > + calls). So, don't make this an error. */ > + if (status != HSA_STATUS_SUCCESS) > + return -1; > + > + hsa_pitched_ptr_t dstpp, srcpp; > + hsa_dim3_t dst_offsets, src_offsets, ranges; > + > + int retval = 1; > + > + hsa_signal_t completion_signal; > + status = hsa_fns.hsa_signal_create_fn (1, 0, NULL, &completion_signal); > + if (status != HSA_STATUS_SUCCESS) > + { > + retval = -1; > + goto unlock; > + } > + > + if (dir == hsaHostToDevice) > + { > + srcpp.base = agent_ptr - first_elem_offset_host; > + dstpp.base = dst; > + } > + else > + { > + srcpp.base = (void *) src; > + dstpp.base = agent_ptr - first_elem_offset_host; > + } > + > + /* Pitch is measured in bytes. */ > + srcpp.pitch = src_dim2_size; > + /* Slice is also measured in bytes (i.e. total per-slice). */ > + srcpp.slice = src_dim2_size * src_dim1_len; > + > + src_offsets.x = src_offset2_size; > + src_offsets.y = src_offset1_len; > + src_offsets.z = src_offset0_len; > + > + /* As above. */ > + dstpp.pitch = dst_dim2_size; > + dstpp.slice = dst_dim2_size * dst_dim1_len; > + > + dst_offsets.x = dst_offset2_size; > + dst_offsets.y = dst_offset1_len; > + dst_offsets.z = dst_offset0_len; > + > + ranges.x = dim2_size; > + ranges.y = dim1_len; > + ranges.z = dim0_len; > + > + status > + = hsa_fns.hsa_amd_memory_async_copy_rect_fn (&dstpp, &dst_offsets, &srcpp, > + &src_offsets, &ranges, > + copy_agent, dir, 0, NULL, > + completion_signal); > + /* If the rectangular copy fails, we might still be able to use the slow > + path. We need to unlock the host memory though, so don't return > + immediately. */ > + if (status != HSA_STATUS_SUCCESS) > + retval = -1; > + else > + { > + hsa_signal_value_t sv > + = hsa_fns.hsa_signal_wait_acquire_fn (completion_signal, > + HSA_SIGNAL_CONDITION_LT, 1, > + UINT64_MAX, > + HSA_WAIT_STATE_ACTIVE); > + if (sv < 0) > + { > + GCN_WARNING ("async copy rect failure"); > + retval = -1; > + } > + } > + > + hsa_fns.hsa_signal_destroy_fn (completion_signal); > + > +unlock: > + status = hsa_fns.hsa_amd_memory_unlock_fn (host_ptr + first_elem_offset_host); > + if (status != HSA_STATUS_SUCCESS) > + hsa_fatal ("Could not unlock host memory", status); > + > + return retval; > +} > + > /* }}} */ > /* {{{ OpenMP Plugin API */ > > diff --git a/libgomp/target.c b/libgomp/target.c > index 9b9b08db8be..8f0795837ae 100644 > --- a/libgomp/target.c > +++ b/libgomp/target.c > @@ -4726,6 +4726,37 @@ omp_target_memcpy_rect_worker (void *dst, const void *src, size_t element_size, > if (__builtin_mul_overflow (span, strides[0], &stride)) > return EINVAL; > > + if (((src_devicep && src_devicep->memcpy2d_func) > + || (dst_devicep && dst_devicep->memcpy2d_func)) > + && (stride % element_size) == 0) > + { > + /* Try using memcpy2d for a 1-dimensional strided access. Here we > + treat the transfer as a 2-dimensional array, where the inner > + dimension is calculated to be (stride in bytes) / element_size. > + Indices/offsets are adjusted so the source/destination pointers > + point to the first element to be transferred, to make the sums > + easier. (There are some configurations of 2D strided accesses > + that memcpy3d could handle similarly, but those are probably rare > + and are unimplemented for now.) */ > + > + /* If stride is element size, this is a contiguous transfer and > + should have been handled above. */ > + assert (stride > element_size); > + > + int dst_id = dst_devicep ? dst_devicep->target_id : -1; > + int src_id = src_devicep ? src_devicep->target_id : -1; > + void *subarray_src = (char *) src + src_off; > + void *subarray_dst = (char *) dst + dst_off; > + > + struct gomp_device_descr *devp = dst_devicep ? dst_devicep > + : src_devicep; > + ret = devp->memcpy2d_func (dst_id, src_id, element_size, volume[0], > + subarray_dst, 0, 0, stride, subarray_src, > + 0, 0, stride); > + if (ret != -1) > + return ret ? 0 : EINVAL; > + } > + > for (i = 0, ret = 1; i < volume[0] && ret; i++) > { > if (src_devicep == NULL) ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
On Thu, 21 Dec 2023 17:05:18 +0100 Tobias Burnus <tobias@codesourcery.com> wrote: > I think it makes sense to split this patch into two parts: > > * The libgomp/plugin/plugin-gcn.c – which is independent and would > already used by omp_memcpy_rect. I will commit this version in a moment. I needed to add the DLSYM_OPT_FN bit from one of Andrew Stubbs's patches elsewhere. Re-tested with offloading to AMD GCN (...with a couple of patches applied locally to get working test results, as plain mainline as of a few days ago wasn't working too well for GCN offloading). Thanks for review! Julian
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c index ef22d48da79..95c0a57e792 100644 --- a/libgomp/plugin/plugin-gcn.c +++ b/libgomp/plugin/plugin-gcn.c @@ -196,6 +196,16 @@ struct hsa_runtime_fn_info hsa_status_t (*hsa_code_object_deserialize_fn) (void *serialized_code_object, size_t serialized_code_object_size, const char *options, hsa_code_object_t *code_object); + hsa_status_t (*hsa_amd_memory_lock_fn) + (void *host_ptr, size_t size, hsa_agent_t *agents, int num_agent, + void **agent_ptr); + hsa_status_t (*hsa_amd_memory_unlock_fn) (void *host_ptr); + hsa_status_t (*hsa_amd_memory_async_copy_rect_fn) + (const hsa_pitched_ptr_t *dst, const hsa_dim3_t *dst_offset, + const hsa_pitched_ptr_t *src, const hsa_dim3_t *src_offset, + const hsa_dim3_t *range, hsa_agent_t copy_agent, + hsa_amd_copy_direction_t dir, uint32_t num_dep_signals, + const hsa_signal_t *dep_signals, hsa_signal_t completion_signal); }; /* Structure describing the run-time and grid properties of an HSA kernel @@ -1398,6 +1408,9 @@ init_hsa_runtime_functions (void) DLSYM_FN (hsa_signal_load_acquire) DLSYM_FN (hsa_queue_destroy) DLSYM_FN (hsa_code_object_deserialize) + DLSYM_OPT_FN (hsa_amd_memory_lock) + DLSYM_OPT_FN (hsa_amd_memory_unlock) + DLSYM_OPT_FN (hsa_amd_memory_async_copy_rect) return true; #undef DLSYM_FN } @@ -3790,6 +3803,352 @@ GOMP_OFFLOAD_dev2dev (int device, void *dst, const void *src, size_t n) return true; } +/* Here <quantity>_size refers to <quantity> multiplied by size -- i.e. + measured in bytes. So we have: + + dim1_size: number of bytes to copy on innermost dimension ("row") + dim0_len: number of rows to copy + dst: base pointer for destination of copy + dst_offset1_size: innermost row offset (for dest), in bytes + dst_offset0_len: offset, number of rows (for dest) + dst_dim1_size: whole-array dest row length, in bytes (pitch) + src: base pointer for source of copy + src_offset1_size: innermost row offset (for source), in bytes + src_offset0_len: offset, number of rows (for source) + src_dim1_size: whole-array source row length, in bytes (pitch) +*/ + +int +GOMP_OFFLOAD_memcpy2d (int dst_ord, int src_ord, size_t dim1_size, + size_t dim0_len, void *dst, size_t dst_offset1_size, + size_t dst_offset0_len, size_t dst_dim1_size, + const void *src, size_t src_offset1_size, + size_t src_offset0_len, size_t src_dim1_size) +{ + if (!hsa_fns.hsa_amd_memory_lock_fn + || !hsa_fns.hsa_amd_memory_unlock_fn + || !hsa_fns.hsa_amd_memory_async_copy_rect_fn) + return -1; + + /* GCN hardware requires 4-byte alignment for base addresses & pitches. Bail + out quietly if we have anything oddly-aligned rather than letting the + driver raise an error. */ + if ((((uintptr_t) dst) & 3) != 0 || (((uintptr_t) src) & 3) != 0) + return -1; + + if ((dst_dim1_size & 3) != 0 || (src_dim1_size & 3) != 0) + return -1; + + /* Only handle host to device or device to host transfers here. */ + if ((dst_ord == -1 && src_ord == -1) + || (dst_ord != -1 && src_ord != -1)) + return -1; + + hsa_amd_copy_direction_t dir + = (src_ord == -1) ? hsaHostToDevice : hsaDeviceToHost; + hsa_agent_t copy_agent; + + /* We need to pin (lock) host memory before we start the transfer. Try to + lock the minimum size necessary, i.e. using partial first/last rows of the + whole array. Something like this: + + rows --> + .............. + c | ..#######+++++ <- first row apart from {src,dst}_offset1_size + o | ++#######+++++ <- whole row + l | ++#######+++++ <- " + s v ++#######..... <- last row apart from trailing remainder + .............. + + We could split very large transfers into several rectangular copies, but + that is unimplemented for now. */ + + size_t bounded_size_host, first_elem_offset_host; + void *host_ptr; + if (dir == hsaHostToDevice) + { + bounded_size_host = src_dim1_size * (dim0_len - 1) + dim1_size; + first_elem_offset_host = src_offset0_len * src_dim1_size + + src_offset1_size; + host_ptr = (void *) src; + struct agent_info *agent = get_agent_info (dst_ord); + copy_agent = agent->id; + } + else + { + bounded_size_host = dst_dim1_size * (dim0_len - 1) + dim1_size; + first_elem_offset_host = dst_offset0_len * dst_dim1_size + + dst_offset1_size; + host_ptr = dst; + struct agent_info *agent = get_agent_info (src_ord); + copy_agent = agent->id; + } + + void *agent_ptr; + + hsa_status_t status + = hsa_fns.hsa_amd_memory_lock_fn (host_ptr + first_elem_offset_host, + bounded_size_host, NULL, 0, &agent_ptr); + /* We can't lock the host memory: don't give up though, we might still be + able to use the slow path in our caller. So, don't make this an + error. */ + if (status != HSA_STATUS_SUCCESS) + return -1; + + hsa_pitched_ptr_t dstpp, srcpp; + hsa_dim3_t dst_offsets, src_offsets, ranges; + + int retval = 1; + + hsa_signal_t completion_signal; + status = hsa_fns.hsa_signal_create_fn (1, 0, NULL, &completion_signal); + if (status != HSA_STATUS_SUCCESS) + { + retval = -1; + goto unlock; + } + + if (dir == hsaHostToDevice) + { + srcpp.base = agent_ptr - first_elem_offset_host; + dstpp.base = dst; + } + else + { + srcpp.base = (void *) src; + dstpp.base = agent_ptr - first_elem_offset_host; + } + + srcpp.pitch = src_dim1_size; + srcpp.slice = 0; + + src_offsets.x = src_offset1_size; + src_offsets.y = src_offset0_len; + src_offsets.z = 0; + + dstpp.pitch = dst_dim1_size; + dstpp.slice = 0; + + dst_offsets.x = dst_offset1_size; + dst_offsets.y = dst_offset0_len; + dst_offsets.z = 0; + + ranges.x = dim1_size; + ranges.y = dim0_len; + ranges.z = 1; + + status + = hsa_fns.hsa_amd_memory_async_copy_rect_fn (&dstpp, &dst_offsets, &srcpp, + &src_offsets, &ranges, + copy_agent, dir, 0, NULL, + completion_signal); + /* If the rectangular copy fails, we might still be able to use the slow + path. We need to unlock the host memory though, so don't return + immediately. */ + if (status != HSA_STATUS_SUCCESS) + retval = -1; + else + hsa_fns.hsa_signal_wait_acquire_fn (completion_signal, + HSA_SIGNAL_CONDITION_LT, 1, UINT64_MAX, + HSA_WAIT_STATE_ACTIVE); + + hsa_fns.hsa_signal_destroy_fn (completion_signal); + +unlock: + status = hsa_fns.hsa_amd_memory_unlock_fn (host_ptr + first_elem_offset_host); + if (status != HSA_STATUS_SUCCESS) + hsa_fatal ("Could not unlock host memory", status); + + return retval; +} + +/* As above, <quantity>_size refers to <quantity> multiplied by size -- i.e. + measured in bytes. So we have: + + dim2_size: number of bytes to copy on innermost dimension ("row") + dim1_len: number of rows per slice to copy + dim0_len: number of slices to copy + dst: base pointer for destination of copy + dst_offset2_size: innermost row offset (for dest), in bytes + dst_offset1_len: offset, number of rows (for dest) + dst_offset0_len: offset, number of slices (for dest) + dst_dim2_size: whole-array dest row length, in bytes (pitch) + dst_dim1_len: whole-array number of rows in slice (for dest) + src: base pointer for source of copy + src_offset2_size: innermost row offset (for source), in bytes + src_offset1_len: offset, number of rows (for source) + src_offset0_len: offset, number of slices (for source) + src_dim2_size: whole-array source row length, in bytes (pitch) + src_dim1_len: whole-array number of rows in slice (for source) +*/ + +int +GOMP_OFFLOAD_memcpy3d (int dst_ord, int src_ord, size_t dim2_size, + size_t dim1_len, size_t dim0_len, void *dst, + size_t dst_offset2_size, size_t dst_offset1_len, + size_t dst_offset0_len, size_t dst_dim2_size, + size_t dst_dim1_len, const void *src, + size_t src_offset2_size, size_t src_offset1_len, + size_t src_offset0_len, size_t src_dim2_size, + size_t src_dim1_len) +{ + if (!hsa_fns.hsa_amd_memory_lock_fn + || !hsa_fns.hsa_amd_memory_unlock_fn + || !hsa_fns.hsa_amd_memory_async_copy_rect_fn) + return -1; + + /* GCN hardware requires 4-byte alignment for base addresses & pitches. Bail + out quietly if we have anything oddly-aligned rather than letting the + driver raise an error. */ + if ((((uintptr_t) dst) & 3) != 0 || (((uintptr_t) src) & 3) != 0) + return -1; + + if ((dst_dim2_size & 3) != 0 || (src_dim2_size & 3) != 0) + return -1; + + /* Only handle host to device or device to host transfers here. */ + if ((dst_ord == -1 && src_ord == -1) + || (dst_ord != -1 && src_ord != -1)) + return -1; + + hsa_amd_copy_direction_t dir + = (src_ord == -1) ? hsaHostToDevice : hsaDeviceToHost; + hsa_agent_t copy_agent; + + /* We need to pin (lock) host memory before we start the transfer. Try to + lock the minimum size necessary, i.e. using partial first/last slices of + the whole 3D array. Something like this: + + slice 0: slice 1: slice 2: + __________ __________ __________ + ^ /+++++++++/ : /+++++++++/ : / / + column /+++##++++/| | /+++##++++/| | /+++## / # = subarray + / / ##++++/ | |/+++##++++/ | |/+++##++++/ + = area to pin + /_________/ : /_________/ : /_________/ + row ---> + + We could split very large transfers into several rectangular copies, but + that is unimplemented for now. */ + + size_t bounded_size_host, first_elem_offset_host; + void *host_ptr; + if (dir == hsaHostToDevice) + { + size_t slice_bytes = src_dim2_size * src_dim1_len; + bounded_size_host = slice_bytes * (dim0_len - 1) + + src_dim2_size * (dim1_len - 1) + + dim2_size; + first_elem_offset_host = src_offset0_len * slice_bytes + + src_offset1_len * src_dim2_size + + src_offset2_size; + host_ptr = (void *) src; + struct agent_info *agent = get_agent_info (dst_ord); + copy_agent = agent->id; + } + else + { + size_t slice_bytes = dst_dim2_size * dst_dim1_len; + bounded_size_host = slice_bytes * (dim0_len - 1) + + dst_dim2_size * (dim1_len - 1) + + dim2_size; + first_elem_offset_host = dst_offset0_len * slice_bytes + + dst_offset1_len * dst_dim2_size + + dst_offset2_size; + host_ptr = dst; + struct agent_info *agent = get_agent_info (src_ord); + copy_agent = agent->id; + } + + void *agent_ptr; + + hsa_status_t status + = hsa_fns.hsa_amd_memory_lock_fn (host_ptr + first_elem_offset_host, + bounded_size_host, NULL, 0, &agent_ptr); + /* We can't lock the host memory: don't give up though, we might still be + able to use the slow path in our caller (maybe even with iterated memcpy2d + calls). So, don't make this an error. */ + if (status != HSA_STATUS_SUCCESS) + return -1; + + hsa_pitched_ptr_t dstpp, srcpp; + hsa_dim3_t dst_offsets, src_offsets, ranges; + + int retval = 1; + + hsa_signal_t completion_signal; + status = hsa_fns.hsa_signal_create_fn (1, 0, NULL, &completion_signal); + if (status != HSA_STATUS_SUCCESS) + { + retval = -1; + goto unlock; + } + + if (dir == hsaHostToDevice) + { + srcpp.base = agent_ptr - first_elem_offset_host; + dstpp.base = dst; + } + else + { + srcpp.base = (void *) src; + dstpp.base = agent_ptr - first_elem_offset_host; + } + + /* Pitch is measured in bytes. */ + srcpp.pitch = src_dim2_size; + /* Slice is also measured in bytes (i.e. total per-slice). */ + srcpp.slice = src_dim2_size * src_dim1_len; + + src_offsets.x = src_offset2_size; + src_offsets.y = src_offset1_len; + src_offsets.z = src_offset0_len; + + /* As above. */ + dstpp.pitch = dst_dim2_size; + dstpp.slice = dst_dim2_size * dst_dim1_len; + + dst_offsets.x = dst_offset2_size; + dst_offsets.y = dst_offset1_len; + dst_offsets.z = dst_offset0_len; + + ranges.x = dim2_size; + ranges.y = dim1_len; + ranges.z = dim0_len; + + status + = hsa_fns.hsa_amd_memory_async_copy_rect_fn (&dstpp, &dst_offsets, &srcpp, + &src_offsets, &ranges, + copy_agent, dir, 0, NULL, + completion_signal); + /* If the rectangular copy fails, we might still be able to use the slow + path. We need to unlock the host memory though, so don't return + immediately. */ + if (status != HSA_STATUS_SUCCESS) + retval = -1; + else + { + hsa_signal_value_t sv + = hsa_fns.hsa_signal_wait_acquire_fn (completion_signal, + HSA_SIGNAL_CONDITION_LT, 1, + UINT64_MAX, + HSA_WAIT_STATE_ACTIVE); + if (sv < 0) + { + GCN_WARNING ("async copy rect failure"); + retval = -1; + } + } + + hsa_fns.hsa_signal_destroy_fn (completion_signal); + +unlock: + status = hsa_fns.hsa_amd_memory_unlock_fn (host_ptr + first_elem_offset_host); + if (status != HSA_STATUS_SUCCESS) + hsa_fatal ("Could not unlock host memory", status); + + return retval; +} + /* }}} */ /* {{{ OpenMP Plugin API */ diff --git a/libgomp/target.c b/libgomp/target.c index 9b9b08db8be..8f0795837ae 100644 --- a/libgomp/target.c +++ b/libgomp/target.c @@ -4726,6 +4726,37 @@ omp_target_memcpy_rect_worker (void *dst, const void *src, size_t element_size, if (__builtin_mul_overflow (span, strides[0], &stride)) return EINVAL; + if (((src_devicep && src_devicep->memcpy2d_func) + || (dst_devicep && dst_devicep->memcpy2d_func)) + && (stride % element_size) == 0) + { + /* Try using memcpy2d for a 1-dimensional strided access. Here we + treat the transfer as a 2-dimensional array, where the inner + dimension is calculated to be (stride in bytes) / element_size. + Indices/offsets are adjusted so the source/destination pointers + point to the first element to be transferred, to make the sums + easier. (There are some configurations of 2D strided accesses + that memcpy3d could handle similarly, but those are probably rare + and are unimplemented for now.) */ + + /* If stride is element size, this is a contiguous transfer and + should have been handled above. */ + assert (stride > element_size); + + int dst_id = dst_devicep ? dst_devicep->target_id : -1; + int src_id = src_devicep ? src_devicep->target_id : -1; + void *subarray_src = (char *) src + src_off; + void *subarray_dst = (char *) dst + dst_off; + + struct gomp_device_descr *devp = dst_devicep ? dst_devicep + : src_devicep; + ret = devp->memcpy2d_func (dst_id, src_id, element_size, volume[0], + subarray_dst, 0, 0, stride, subarray_src, + 0, 0, stride); + if (ret != -1) + return ret ? 0 : EINVAL; + } + for (i = 0, ret = 1; i < volume[0] && ret; i++) { if (src_devicep == NULL)