From patchwork Thu Aug 29 13:32:29 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Biener X-Patchwork-Id: 1978449 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256 header.s=susede2_rsa header.b=vRQcsgO9; dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256 header.s=susede2_ed25519 header.b=8NIoLXdd; dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de header.a=rsa-sha256 header.s=susede2_rsa header.b=vRQcsgO9; dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256 header.s=susede2_ed25519 header.b=8NIoLXdd; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4Wvj0056JPz1yZ9 for ; Thu, 29 Aug 2024 23:33:04 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 78BC43860772 for ; Thu, 29 Aug 2024 13:33:02 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2a07:de40:b251:101:10:150:64:2]) by sourceware.org (Postfix) with ESMTPS id C268A385F01D for ; Thu, 29 Aug 2024 13:32:30 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C268A385F01D Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de ARC-Filter: OpenARC Filter v1.0.0 sourceware.org C268A385F01D Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a07:de40:b251:101:10:150:64:2 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1724938354; cv=none; b=TfoHW/62zhsdLhpyXYpyglrWIx8WStlERO8RYj5cyv7r3YuahfaHAYNIhxyvXUCJphVjIDIONukevxERj8xrUKRmY/bSegYWECP+VO+/7bjQhUX4t5HdyK7ruE3wMyp+2qcM1gOOqVXMiJEQBKgKLfq4rtGQ899I9MN+u0CDJ0w= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1724938354; c=relaxed/simple; bh=pB7Lx/jAcJDBZoplrLzxIiDwyb9xGE0tj5c13YskEfI=; h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date: From:To:Subject:MIME-Version; b=WbuYP7GEPJt2VdHg4x4e0vDAk85fc5cnSYUSXBYVQDtOpv2d2GsW5tbj5UPV/NLLzooO+mEa8PJWEm6BbaJ2rJsIOkdFkcwBXSpkHsFpkkSimFZ0Q2ncnrMbvhinJFolhBaanjponGDIeU8lpugu0HGghZjodXw1Vw+Nqj9gMfc= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from murzim.nue2.suse.org (unknown [10.168.4.243]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id B328E1F391 for ; Thu, 29 Aug 2024 13:32:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1724938349; h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version: content-type:content-type; bh=WcsB1k+qDAmWnVQzJ++BMvxV3cWJzT3PI6fUv2uFbdw=; b=vRQcsgO9gINfYpebBXANwO1PQ3puC2UVfsI9ifa7vVrJneShB9Q/yT4D+A+Mjjn6iSkyxb M4GhM4k6oEk5STu7scSfyhl1VHBBuPi8/9JhXFMcv074aFsHD6A7fsgI1FGN9M2AACjN4x b9OXJt5JVZjt2jzAiop4cCmnvoYSu+k= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1724938349; h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version: content-type:content-type; bh=WcsB1k+qDAmWnVQzJ++BMvxV3cWJzT3PI6fUv2uFbdw=; b=8NIoLXddoFv/iI6qFpMWhqZjzExZG3ffYHLmar0UbMRFpTW0Hml/CbqyfC3xl6WZaJulWv OsvVS+4ssyJbqYCQ== Authentication-Results: smtp-out2.suse.de; none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1724938349; h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version: content-type:content-type; bh=WcsB1k+qDAmWnVQzJ++BMvxV3cWJzT3PI6fUv2uFbdw=; b=vRQcsgO9gINfYpebBXANwO1PQ3puC2UVfsI9ifa7vVrJneShB9Q/yT4D+A+Mjjn6iSkyxb M4GhM4k6oEk5STu7scSfyhl1VHBBuPi8/9JhXFMcv074aFsHD6A7fsgI1FGN9M2AACjN4x b9OXJt5JVZjt2jzAiop4cCmnvoYSu+k= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1724938349; h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version: content-type:content-type; bh=WcsB1k+qDAmWnVQzJ++BMvxV3cWJzT3PI6fUv2uFbdw=; b=8NIoLXddoFv/iI6qFpMWhqZjzExZG3ffYHLmar0UbMRFpTW0Hml/CbqyfC3xl6WZaJulWv OsvVS+4ssyJbqYCQ== Date: Thu, 29 Aug 2024 15:32:29 +0200 (CEST) From: Richard Biener To: gcc-patches@gcc.gnu.org Subject: [PATCH 1/4] lower SLP load permutation to interleaving MIME-Version: 1.0 X-Spam-Score: -1.10 X-Spamd-Result: default: False [-1.10 / 50.00]; BAYES_HAM(-3.00)[100.00%]; MISSING_MID(2.50)[]; NEURAL_HAM_LONG(-0.34)[-0.341]; NEURAL_HAM_SHORT(-0.16)[-0.804]; MIME_GOOD(-0.10)[text/plain]; RCPT_COUNT_ONE(0.00)[1]; RCVD_COUNT_ZERO(0.00)[0]; ARC_NA(0.00)[]; MISSING_XM_UA(0.00)[]; DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; FUZZY_BLOCKED(0.00)[rspamd.com]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; TO_DN_NONE(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; FROM_HAS_DN(0.00)[] X-Spam-Level: X-Spam-Status: No, score=-10.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, MISSING_MID, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org Message-Id: <20240829133302.78BC43860772@sourceware.org> The following emulates classical interleaving for SLP load permutes that we are unlikely handling natively. This is to handle cases where interleaving (or load/store-lanes) is the optimal choice for vectorizing even when we are doing that within SLP. An example would be void foo (int * __restrict a, int * b) { for (int i = 0; i < 16; ++i) { a[4*i + 0] = b[4*i + 0] * 3; a[4*i + 1] = b[4*i + 1] + 3; a[4*i + 2] = (b[4*i + 2] * 3 + 3); a[4*i + 3] = b[4*i + 3] * 3; } } where currently the SLP store is merging four single-lane SLP sub-graphs but none of the loads in it can be code-generated with V4SImode vectors and a VF of four as the permutes would need three vectors. The patch introduces a lowering phase after SLP discovery but before SLP pattern recognition or permute optimization that analyzes all loads from the same dataref group and creates an interleaving scheme starting from an unpermuted load. What can be handled is power-of-two group size and a group size of three. The possibility for doing the interleaving with a load-lanes like instruction is done as followup. For a group-size of three this is done by using the non-interleaving fallback code which then creates at VF == 4 from { { a0, b0, c0 }, { a1, b1, c1 }, { a2, b2, c2 }, { a3, b3, c3 } } the intermediate vectors { c0, c0, c1, c1 } and { c2, c2, c3, c3 } to produce { c0, c1, c2, c3 }. This turns out to be more effective than the scheme implemented for non-SLP for SSE and only slightly worse for AVX512 and a bit more worse for AVX2. It seems to me that this would extend to other non-power-of-two group-sizes though (but the patch does not). Optimal schemes are likely difficult to lay out in VF agnostic form. I'll note that while the lowering assumes even/odd extract is generally available for all vector element sizes (which is probably a good assumption), it doesn't in any way constrain the other permutes it generates based on target availability. Again difficult to do in a VF agnostic way (but at least currently the vector type is fixed). I'll also note that the SLP store side merges lanes in a way producing three-vector permutes for store group-size of three, so the testcase uses a store group-size of four. The patch has a fallback for when there are multi-lane groups and the resulting permutes to not fit interleaving. Code generation is not optimal when this triggers and might be worse than doing single-lane group interleaving. The patch handles gaps by representing them with NULL entries in SLP_TREE_SCALAR_STMTS for the unpermuted load node. The SLP discovery changes could be elided if we manually build the load node instead. SLP load nodes covering enough lanes to not need intermediate permutes are retained as having a load-permutation and do not use the single SLP load node for each dataref group. That's something we might want to change, making load-permutation something purely local to SLP discovery (but then SLP discovery could do part of the lowering). The patch misses CSEing intermediate generated permutes and registering them with the bst_map which is possibly required for SLP pattern detection in some cases - this re-spin of the patch moves the lowering after SLP pattern detection. * tree-vect-slp.cc (vect_build_slp_tree_1): Handle NULL stmt. (vect_build_slp_tree_2): Likewise. Release load permutation when there's a NULL in SLP_TREE_SCALAR_STMTS and assert there's no actual permutation in that case. (vllp_cmp): New function. (vect_lower_load_permutations): Likewise. (vect_analyze_slp): Call it. * gcc.dg/vect/slp-11a.c: Expect SLP. * gcc.dg/vect/slp-12a.c: Likewise. * gcc.dg/vect/slp-51.c: New testcase. * gcc.dg/vect/slp-52.c: New testcase. --- gcc/testsuite/gcc.dg/vect/slp-11a.c | 2 +- gcc/testsuite/gcc.dg/vect/slp-12a.c | 2 +- gcc/testsuite/gcc.dg/vect/slp-51.c | 17 ++ gcc/testsuite/gcc.dg/vect/slp-52.c | 14 ++ gcc/tree-vect-slp.cc | 347 +++++++++++++++++++++++++++- 5 files changed, 378 insertions(+), 4 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/slp-51.c create mode 100644 gcc/testsuite/gcc.dg/vect/slp-52.c diff --git a/gcc/testsuite/gcc.dg/vect/slp-11a.c b/gcc/testsuite/gcc.dg/vect/slp-11a.c index fcb7cf6c7a2..2efa1796757 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-11a.c +++ b/gcc/testsuite/gcc.dg/vect/slp-11a.c @@ -72,4 +72,4 @@ int main (void) /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_strided8 && vect_int_mult } } } } */ /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-12a.c b/gcc/testsuite/gcc.dg/vect/slp-12a.c index 2f98dc9da0b..fedf27b69d2 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-12a.c +++ b/gcc/testsuite/gcc.dg/vect/slp-12a.c @@ -80,5 +80,5 @@ int main (void) /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_strided8 && vect_int_mult } } } } */ /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { { vect_strided8 && {! vect_load_lanes } } && vect_int_mult } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { { vect_strided8 && {! vect_load_lanes } } && vect_int_mult } } } } */ /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-51.c b/gcc/testsuite/gcc.dg/vect/slp-51.c new file mode 100644 index 00000000000..91ae763be30 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/slp-51.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ + +void foo (int * __restrict x, int *y) +{ + x = __builtin_assume_aligned (x, __BIGGEST_ALIGNMENT__); + y = __builtin_assume_aligned (y, __BIGGEST_ALIGNMENT__); + for (int i = 0; i < 1024; ++i) + { + x[4*i+0] = y[4*i+0]; + x[4*i+1] = y[4*i+2] * 2; + x[4*i+2] = y[4*i+0] + 3; + x[4*i+3] = y[4*i+2] * 2 - 5; + } +} + +/* Check we can handle SLP with gaps and an interleaving scheme. */ +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { vect_int && vect_int_mult } } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-52.c b/gcc/testsuite/gcc.dg/vect/slp-52.c new file mode 100644 index 00000000000..ba49f0046e2 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/slp-52.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ + +void foo (int * __restrict x, int *y) +{ + for (int i = 0; i < 1024; ++i) + { + x[4*i+0] = y[3*i+0]; + x[4*i+1] = y[3*i+1] * 2; + x[4*i+2] = y[3*i+2] + 3; + x[4*i+3] = y[3*i+2] * 2 - 5; + } +} + +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { vect_int && vect_int_mult } } } } */ diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index fe4981177bf..9cc7c59f94f 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -1080,10 +1080,15 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char *swap, stmt_vec_info stmt_info; FOR_EACH_VEC_ELT (stmts, i, stmt_info) { - gimple *stmt = stmt_info->stmt; swap[i] = 0; matches[i] = false; + if (!stmt_info) + { + matches[i] = true; + continue; + } + gimple *stmt = stmt_info->stmt; if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "Build SLP for %G", stmt); @@ -1978,10 +1983,16 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node, stmt_vec_info first_stmt_info = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (node)[0]); bool any_permute = false; + bool any_null = false; FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), j, load_info) { int load_place; - if (STMT_VINFO_GROUPED_ACCESS (stmt_info)) + if (! load_info) + { + load_place = j; + any_null = true; + } + else if (STMT_VINFO_GROUPED_ACCESS (stmt_info)) load_place = vect_get_place_in_interleaving_chain (load_info, first_stmt_info); else @@ -1990,6 +2001,11 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node, any_permute |= load_place != j; load_permutation.quick_push (load_place); } + if (any_null) + { + gcc_assert (!any_permute); + load_permutation.release (); + } if (gcall *stmt = dyn_cast (stmt_info->stmt)) { @@ -4077,6 +4093,316 @@ vect_analyze_slp_instance (vec_info *vinfo, return res; } +/* qsort comparator ordering SLP load nodes. */ + +static int +vllp_cmp (const void *a_, const void *b_) +{ + const slp_tree a = *(const slp_tree *)a_; + const slp_tree b = *(const slp_tree *)b_; + stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (a)[0]; + stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (b)[0]; + if (STMT_VINFO_GROUPED_ACCESS (a0) + && STMT_VINFO_GROUPED_ACCESS (b0) + && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0)) + { + /* Same group, order after lanes used. */ + if (SLP_TREE_LANES (a) < SLP_TREE_LANES (b)) + return 1; + else if (SLP_TREE_LANES (a) > SLP_TREE_LANES (b)) + return -1; + else + { + /* Try to order loads using the same lanes together, breaking + the tie with the lane number that first differs. */ + if (!SLP_TREE_LOAD_PERMUTATION (a).exists () + && !SLP_TREE_LOAD_PERMUTATION (b).exists ()) + return 0; + else if (SLP_TREE_LOAD_PERMUTATION (a).exists () + && !SLP_TREE_LOAD_PERMUTATION (b).exists ()) + return 1; + else if (!SLP_TREE_LOAD_PERMUTATION (a).exists () + && SLP_TREE_LOAD_PERMUTATION (b).exists ()) + return -1; + else + { + for (unsigned i = 0; i < SLP_TREE_LANES (a); ++i) + if (SLP_TREE_LOAD_PERMUTATION (a)[i] + != SLP_TREE_LOAD_PERMUTATION (b)[i]) + { + /* In-order lane first, that's what the above case for + no permutation does. */ + if (SLP_TREE_LOAD_PERMUTATION (a)[i] == i) + return -1; + else if (SLP_TREE_LOAD_PERMUTATION (b)[i] == i) + return 1; + else if (SLP_TREE_LOAD_PERMUTATION (a)[i] + < SLP_TREE_LOAD_PERMUTATION (b)[i]) + return -1; + else + return 1; + } + return 0; + } + } + } + else /* Different groups or non-groups. */ + { + /* Order groups as their first element to keep them together. */ + if (STMT_VINFO_GROUPED_ACCESS (a0)) + a0 = DR_GROUP_FIRST_ELEMENT (a0); + if (STMT_VINFO_GROUPED_ACCESS (b0)) + b0 = DR_GROUP_FIRST_ELEMENT (b0); + if (a0 == b0) + return 0; + /* Tie using UID. */ + else if (gimple_uid (STMT_VINFO_STMT (a0)) + < gimple_uid (STMT_VINFO_STMT (b0))) + return -1; + else + { + gcc_assert (gimple_uid (STMT_VINFO_STMT (a0)) + != gimple_uid (STMT_VINFO_STMT (b0))); + return 1; + } + } +} + +/* Process the set of LOADS that are all from the same dataref group. */ + +static void +vect_lower_load_permutations (loop_vec_info loop_vinfo, + scalar_stmts_to_slp_tree_map_t *bst_map, + const array_slice &loads) +{ + /* We at this point want to lower without a fixed VF or vector + size in mind which means we cannot actually compute whether we + need three or more vectors for a load permutation yet. So always + lower. */ + stmt_vec_info first + = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]); + + /* Only a power-of-two number of lanes matches interleaving with N levels. + ??? An even number of lanes could be reduced to 1<= (group_lanes + 1) / 2) + continue; + + /* First build (and possibly re-use) a load node for the + unpermuted group. Gaps in the middle and on the end are + represented with NULL stmts. */ + vec stmts; + stmts.create (group_lanes); + for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s)) + { + if (s != first) + for (unsigned i = 1; i < DR_GROUP_GAP (s); ++i) + stmts.quick_push (NULL); + stmts.quick_push (s); + } + for (unsigned i = 0; i < DR_GROUP_GAP (first); ++i) + stmts.quick_push (NULL); + poly_uint64 max_nunits = 1; + bool *matches = XALLOCAVEC (bool, group_lanes); + unsigned limit = 1; + unsigned tree_size = 0; + slp_tree l0 = vect_build_slp_tree (loop_vinfo, stmts, + group_lanes, + &max_nunits, matches, &limit, + &tree_size, bst_map); + + /* Build the permute to get the original load permutation order. */ + lane_permutation_t final_perm; + final_perm.create (SLP_TREE_LANES (load)); + for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i) + final_perm.quick_push + (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i])); + + while (1) + { + unsigned group_lanes = SLP_TREE_LANES (l0); + if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2) + break; + + /* Try to lower by reducing the group to half its size using an + interleaving scheme. For this try to compute whether all + elements needed for this load are in even or odd elements of + an even/odd decomposition with N consecutive elements. + Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition + with N == 2. */ + /* ??? Only an even number of lanes can be handed this way, but the + fallback below could work for any number. We have to make sure + to round up in that case. */ + gcc_assert ((group_lanes & 1) == 0 || group_lanes == 3); + unsigned even = 0, odd = 0; + if ((group_lanes & 1) == 0) + { + even = (1 << ceil_log2 (group_lanes)) - 1; + odd = even; + for (auto l : final_perm) + { + even &= ~l.second; + odd &= l.second; + } + } + + /* Now build an even or odd extraction from the unpermuted load. */ + lane_permutation_t perm; + perm.create ((group_lanes + 1) / 2); + unsigned level; + if (even + && ((level = 1 << ctz_hwi (even)), true) + && group_lanes % (2 * level) == 0) + { + /* { 0, 1, ... 4, 5 ..., } */ + unsigned level = 1 << ctz_hwi (even); + for (unsigned i = 0; i < group_lanes / 2 / level; ++i) + for (unsigned j = 0; j < level; ++j) + perm.quick_push (std::make_pair (0, 2 * i * level + j)); + } + else if (odd) + { + /* { ..., 2, 3, ... 6, 7 } */ + unsigned level = 1 << ctz_hwi (odd); + gcc_assert (group_lanes % (2 * level) == 0); + for (unsigned i = 0; i < group_lanes / 2 / level; ++i) + for (unsigned j = 0; j < level; ++j) + perm.quick_push (std::make_pair (0, (2 * i + 1) * level + j)); + } + else + { + /* As fallback extract all used lanes and fill to half the + group size by repeating the last element. + ??? This is quite a bad strathegy for re-use - we could + brute force our way to find more optimal filling lanes to + maximize re-use when looking at all loads from the group. */ + auto_bitmap l; + for (auto p : final_perm) + bitmap_set_bit (l, p.second); + unsigned i = 0; + bitmap_iterator bi; + EXECUTE_IF_SET_IN_BITMAP (l, 0, i, bi) + perm.quick_push (std::make_pair (0, i)); + while (perm.length () < (group_lanes + 1) / 2) + perm.quick_push (perm.last ()); + } + + /* Update final_perm with the intermediate permute. */ + for (unsigned i = 0; i < final_perm.length (); ++i) + { + unsigned l = final_perm[i].second; + unsigned j; + for (j = 0; j < perm.length (); ++j) + if (perm[j].second == l) + { + final_perm[i].second = j; + break; + } + gcc_assert (j < perm.length ()); + } + + /* And create scalar stmts. */ + vec perm_stmts; + perm_stmts.create (perm.length ()); + for (unsigned i = 0; i < perm.length (); ++i) + perm_stmts.quick_push (SLP_TREE_SCALAR_STMTS (l0)[perm[i].second]); + + slp_tree p = vect_create_new_slp_node (1, VEC_PERM_EXPR); + SLP_TREE_CHILDREN (p).quick_push (l0); + SLP_TREE_LANE_PERMUTATION (p) = perm; + SLP_TREE_VECTYPE (p) = SLP_TREE_VECTYPE (load); + SLP_TREE_LANES (p) = perm.length (); + SLP_TREE_REPRESENTATIVE (p) = SLP_TREE_REPRESENTATIVE (load); + /* ??? As we have scalar stmts for this intermediate permute we + could CSE it via bst_map but we do not want to pick up + another SLP node with a load permutation. We instead should + have a "local" CSE map here. */ + SLP_TREE_SCALAR_STMTS (p) = perm_stmts; + + /* We now have a node for (group_lanes + 1) / 2 lanes. */ + l0 = p; + } + + /* And finally from the ordered reduction node create the + permute to shuffle the lanes into the original load-permutation + order. We replace the original load node with this. */ + SLP_TREE_CODE (load) = VEC_PERM_EXPR; + SLP_TREE_LOAD_PERMUTATION (load).release (); + SLP_TREE_LANE_PERMUTATION (load) = final_perm; + SLP_TREE_CHILDREN (load).create (1); + SLP_TREE_CHILDREN (load).quick_push (l0); + } +} + +/* Transform SLP loads in the SLP graph created by SLP discovery to + group loads from the same group and lower load permutations that + are unlikely to be supported into a series of permutes. + In the degenerate case of having only single-lane SLP instances + this should result in a series of permute nodes emulating an + interleaving scheme. */ + +static void +vect_lower_load_permutations (loop_vec_info loop_vinfo, + scalar_stmts_to_slp_tree_map_t *bst_map) +{ + /* Gather and sort loads across all instances. */ + hash_set visited; + auto_vec loads; + for (auto inst : loop_vinfo->slp_instances) + vect_gather_slp_loads (loads, SLP_INSTANCE_TREE (inst), visited); + if (loads.is_empty ()) + return; + loads.qsort (vllp_cmp); + + /* Now process each dataref group separately. */ + unsigned firsti = 0; + for (unsigned i = 1; i < loads.length (); ++i) + { + slp_tree first = loads[firsti]; + slp_tree next = loads[i]; + stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (first)[0]; + stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (next)[0]; + if (STMT_VINFO_GROUPED_ACCESS (a0) + && STMT_VINFO_GROUPED_ACCESS (b0) + && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0)) + continue; + /* Just one SLP load of a possible group, leave those alone. */ + if (i == firsti + 1) + { + firsti = i; + continue; + } + /* Now we have multiple SLP loads of the same group from + firsti to i - 1. */ + vect_lower_load_permutations (loop_vinfo, bst_map, + make_array_slice (&loads[firsti], + i - firsti)); + firsti = i; + } + if (firsti < loads.length () - 1) + vect_lower_load_permutations (loop_vinfo, bst_map, + make_array_slice (&loads[firsti], + loads.length () - firsti)); +} + /* Check if there are stmts in the loop can be vectorized using SLP. Build SLP trees of packed scalar stmts if SLP is possible. */ @@ -4241,6 +4567,23 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size) } } + /* When we end up with load permutations that we cannot possibly handle, + like those requiring three vector inputs, lower them using interleaving + like schemes. */ + if (loop_vec_info loop_vinfo = dyn_cast (vinfo)) + { + vect_lower_load_permutations (loop_vinfo, bst_map); + if (dump_enabled_p ()) + { + dump_printf_loc (MSG_NOTE, vect_location, + "SLP graph after lowering permutations:\n"); + hash_set visited; + FOR_EACH_VEC_ELT (LOOP_VINFO_SLP_INSTANCES (vinfo), i, instance) + vect_print_slp_graph (MSG_NOTE, vect_location, + SLP_INSTANCE_TREE (instance), visited); + } + } + release_scalar_stmts_to_slp_tree_map (bst_map); if (pattern_found && dump_enabled_p ()) From patchwork Thu Aug 29 13:32:34 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Biener X-Patchwork-Id: 1978452 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256 header.s=susede2_rsa header.b=SuWvMU4Y; dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256 header.s=susede2_ed25519 header.b=k+u4JymL; dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de header.a=rsa-sha256 header.s=susede2_rsa header.b=SuWvMU4Y; dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256 header.s=susede2_ed25519 header.b=k+u4JymL; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4Wvj070357z1yZ9 for ; Thu, 29 Aug 2024 23:33:11 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id D09F23861828 for ; Thu, 29 Aug 2024 13:33:08 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) by sourceware.org (Postfix) with ESMTPS id 71E63386074B for ; Thu, 29 Aug 2024 13:32:35 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 71E63386074B Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 71E63386074B Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=195.135.223.130 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1724938362; cv=none; b=wkJChPqImkFUV32FrY+EIP13z6G587bgG4iFzTiBQNgUaAMEJnQ7puC/Apct9fkqxuqW4c4+aGcA3XSA6YXFKbb6IEyn6a2BkKxiOKWlS9VJXjtckFKEGDTfhe1XKgSqdAKRw0OSZm5LfbK4tpbzmNkioNCdFIWofAqwUf4Ccr8= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1724938362; c=relaxed/simple; bh=OD8fJGJw/2AkphccjdoFCUPT3wTc8KgWeevS+7QDSW0=; h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date: From:To:Subject:MIME-Version; b=orpDDhzeZt+MfNtSPudOUPoSS6a2DdE6HVMKpHkaIROtB2RIQi2e2372Qbh6AsbAKGreMTHXrGwoIhGH0N7wkMwMRHi/AAhQzRimk1IHpB1oDfonPJff5NUJZTwbgGiaB23veLL92qn1OZYhOsiNREXERwPAcZ5lbJH2FYSy9P8= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from murzim.nue2.suse.org (unknown [10.168.4.243]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 70066218B8 for ; Thu, 29 Aug 2024 13:32:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1724938354; h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version: content-type:content-type; bh=BCjICRU+CXEo7sWXE4E9efCuSZvO6p7NeyYsjas3F4s=; b=SuWvMU4YcIT8ssFom5YQsyFpGAa3xsH4Ei35NwCCqALn2Fx7kokWB1VCRdov4LQcrQbDbE QSwHGufRNjwDHYuA4qIqrmQNvJL7q1I1wLryxxubwLQW/LBgUE7JjCoXiwK7nKyfMZmDVN AtBFpNLW61q55GtfeRj4jwDk7ICz0EQ= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1724938354; h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version: content-type:content-type; bh=BCjICRU+CXEo7sWXE4E9efCuSZvO6p7NeyYsjas3F4s=; b=k+u4JymLpunWGeD5AnxvbBRgvJEtYQXI/uzGgddaHHpzdzROlgnQ0kVwllDLGacfCZYbqs tyMReck0t/gqYECQ== Authentication-Results: smtp-out1.suse.de; none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1724938354; h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version: content-type:content-type; bh=BCjICRU+CXEo7sWXE4E9efCuSZvO6p7NeyYsjas3F4s=; b=SuWvMU4YcIT8ssFom5YQsyFpGAa3xsH4Ei35NwCCqALn2Fx7kokWB1VCRdov4LQcrQbDbE QSwHGufRNjwDHYuA4qIqrmQNvJL7q1I1wLryxxubwLQW/LBgUE7JjCoXiwK7nKyfMZmDVN AtBFpNLW61q55GtfeRj4jwDk7ICz0EQ= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1724938354; h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version: content-type:content-type; bh=BCjICRU+CXEo7sWXE4E9efCuSZvO6p7NeyYsjas3F4s=; b=k+u4JymLpunWGeD5AnxvbBRgvJEtYQXI/uzGgddaHHpzdzROlgnQ0kVwllDLGacfCZYbqs tyMReck0t/gqYECQ== Date: Thu, 29 Aug 2024 15:32:34 +0200 (CEST) From: Richard Biener To: gcc-patches@gcc.gnu.org Subject: [PATCH 2/4] load and store-lanes with SLP MIME-Version: 1.0 X-Spam-Score: 0.55 X-Spamd-Result: default: False [0.55 / 50.00]; BAYES_HAM(-3.00)[100.00%]; MISSING_MID(2.50)[]; NEURAL_SPAM_LONG(1.27)[0.362]; NEURAL_HAM_SHORT(-0.12)[-0.579]; MIME_GOOD(-0.10)[text/plain]; RCPT_COUNT_ONE(0.00)[1]; RCVD_COUNT_ZERO(0.00)[0]; ARC_NA(0.00)[]; MISSING_XM_UA(0.00)[]; DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; FUZZY_BLOCKED(0.00)[rspamd.com]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; TO_DN_NONE(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; FROM_HAS_DN(0.00)[] X-Spam-Level: X-Spam-Status: No, score=-10.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, MISSING_MID, SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org Message-Id: <20240829133308.D09F23861828@sourceware.org> The following is a prototype for how to represent load/store-lanes within SLP. I've for now settled with having a single load node with multiple permute nodes acting as selection, one for each loaded lane and a single store node fed from all stored lanes. For for (int i = 0; i < 1024; ++i) { a[2*i] = b[2*i] + 7; a[2*i+1] = b[2*i+1] * 3; } you have the following SLP graph where I explain how things are set up and code-generated: t.c:23:21: note: SLP graph after lowering permutations: t.c:23:21: note: node 0x50dc8b0 (max_nunits=1, refcnt=1) vector(4) int t.c:23:21: note: op template: *_6 = _7; t.c:23:21: note: stmt 0 *_6 = _7; t.c:23:21: note: stmt 1 *_12 = _13; t.c:23:21: note: children 0x50dc488 0x50dc6e8 This is the store node, it's marked with ldst_lanes = true during SLP discovery. This node code-generates vect_array.65[0] = vect__7.61_29; vect_array.65[1] = vect__13.62_28; MEM [(int *)vectp_a.63_27] = .STORE_LANES (vect_array.65); ... t.c:23:21: note: node 0x50dc520 (max_nunits=4, refcnt=2) vector(4) int t.c:23:21: note: op: VEC_PERM_EXPR t.c:23:21: note: stmt 0 _5 = *_4; t.c:23:21: note: lane permutation { 0[0] } t.c:23:21: note: children 0x50dc948 t.c:23:21: note: node 0x50dc780 (max_nunits=4, refcnt=2) vector(4) int t.c:23:21: note: op: VEC_PERM_EXPR t.c:23:21: note: stmt 0 _11 = *_10; t.c:23:21: note: lane permutation { 0[1] } t.c:23:21: note: children 0x50dc948 These are the selection nodes, marked with ldst_lanes = true. They code generate nothing. t.c:23:21: note: node 0x50dc948 (max_nunits=4, refcnt=3) vector(4) int t.c:23:21: note: op template: _5 = *_4; t.c:23:21: note: stmt 0 _5 = *_4; t.c:23:21: note: stmt 1 _11 = *_10; t.c:23:21: note: load permutation { 0 1 } This is the load node, marked with ldst_lanes = true (the load permutation is only accurate when taking into account the lane permute in the selection nodes). It code generates vect_array.58 = .LOAD_LANES (MEM [(int *)vectp_b.56_33]); vect__5.59_31 = vect_array.58[0]; vect__5.60_30 = vect_array.58[1]; This scheme allows to leave code generation in vectorizable_load/store mostly as-is. While this should support both load-lanes and (masked) store-lanes the decision to do either is done during SLP discovery time and cannot be reversed without altering the SLP tree - as-is the SLP tree is not usable for non-store-lanes on the store side, the load side is OK representation-wise but will very likely fail permute handling as the lowering to deal with the two input vector restriction isn't done - but of course since the permute node is marked as to be ignored that doesn't work out. So I've put restrictions in place that fail vectorization if a load/store-lane SLP tree is later classified differently by get_load_store_type. I'll note that for example gcc.target/aarch64/sve/mask_struct_store_3.c will not get SLP store-lanes used because the full store SLPs just fine though we then fail to handle the "splat" load-permutation t2.c:5:21: note: node 0x4db2630 (max_nunits=4, refcnt=2) vector([4,4]) int t2.c:5:21: note: op template: _6 = *_5; t2.c:5:21: note: stmt 0 _6 = *_5; t2.c:5:21: note: stmt 1 _6 = *_5; t2.c:5:21: note: stmt 2 _6 = *_5; t2.c:5:21: note: stmt 3 _6 = *_5; t2.c:5:21: note: load permutation { 0 0 0 0 } the load permute lowering code currently doesn't consider it worth lowering single loads from a group (or in this case not grouped loads). The expectation is the target can handle this by two interleaves with itself. So what we see here is that while the explicit SLP representation is helpful in some cases, in cases like this it would require changing it when we make decisions how to vectorize. My idea is that this all will change a lot when we re-do SLP discovery (for loops) and when we get rid of non-SLP as I think vectorizable_* should be allowed to alter the SLP graph during analysis. Testing on RISC-V and aarch64 reveals that when we remove the late fallback in vect_analyze_loop_2 to scrap SLP and start over without to get load/store-lanes is still effective and that trying to perform the equivalent on the SLP representation by enforcing single-lane discover and SLP load/store-lane interacts with SLP pattern recognition for complex ops. I have a test followup running into this, I am now considering to instead move this to after pattern recognition which is also why I now moved the load permute lowering there. * tree-vectorizer.h (_slp_tree::ldst_lanes): New flag to mark load, store and permute nodes. * tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize ldst_lanes. (vect_build_slp_instance): For stores iff the target prefers store-lanes discover single-lane sub-groups, do not perform interleaving lowering but mark the node with ldst_lanes. (vect_lower_load_permutations): When the target supports load lanes and the loads all fit the pattern split out a single level of permutes only and mark the load and permute nodes with ldst_lanes. (vectorizable_slp_permutation_1): Handle the load-lane permute forwarding of vector defs. * tree-vect-stmts.cc (get_group_load_store_type): Support load/store-lanes for SLP. (vectorizable_store): Support SLP code generation for store-lanes. (vectorizable_load): Support SLP code generation for load-lanes. * gcc.dg/vect/slp-55.c: New testcase. * gcc.dg/vect/slp-56.c: Likewise. * gcc.dg/vect/slp-11c.c: Adjust. * gcc.dg/vect/slp-53.c: Likewise. * gcc.dg/vect/slp-cond-1.c: Likewise. * gcc.dg/vect/vect-complex-5.c: Likewise. --- gcc/testsuite/gcc.dg/vect/slp-11c.c | 3 +- gcc/testsuite/gcc.dg/vect/slp-53.c | 3 +- gcc/testsuite/gcc.dg/vect/slp-55.c | 37 ++++++ gcc/testsuite/gcc.dg/vect/slp-56.c | 51 ++++++++ gcc/testsuite/gcc.dg/vect/slp-cond-1.c | 3 +- gcc/testsuite/gcc.dg/vect/vect-complex-5.c | 3 +- gcc/tree-vect-slp.cc | 133 +++++++++++++++++++-- gcc/tree-vect-stmts.cc | 127 +++++++++++++++----- gcc/tree-vectorizer.h | 4 + 9 files changed, 319 insertions(+), 45 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/slp-55.c create mode 100644 gcc/testsuite/gcc.dg/vect/slp-56.c diff --git a/gcc/testsuite/gcc.dg/vect/slp-11c.c b/gcc/testsuite/gcc.dg/vect/slp-11c.c index 2e70fca39ba..25d7f2ce383 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-11c.c +++ b/gcc/testsuite/gcc.dg/vect/slp-11c.c @@ -45,5 +45,4 @@ int main (void) /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { { vect_uintfloat_cvt && vect_strided2 } && vect_int_mult } } } } */ /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! { { vect_uintfloat_cvt && vect_strided2 } && vect_int_mult } } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { vect_load_lanes } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { ! vect_load_lanes } } } } */ +/* { dg-final { scan-tree-dump "LOAD_LANES" "vect" { target { vect_load_lanes } } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-53.c b/gcc/testsuite/gcc.dg/vect/slp-53.c index d8cd5f85b3c..50b3e9d3cee 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-53.c +++ b/gcc/testsuite/gcc.dg/vect/slp-53.c @@ -12,4 +12,5 @@ void foo (int * __restrict x, int *y) } } -/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { vect_int && vect_int_mult } xfail vect_load_lanes } } } */ +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { vect_int && vect_int_mult } } } } */ +/* { dg-final { scan-tree-dump "LOAD_LANES" "vect" { target { vect_load_lanes } } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-55.c b/gcc/testsuite/gcc.dg/vect/slp-55.c new file mode 100644 index 00000000000..0bf65ef6dc4 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/slp-55.c @@ -0,0 +1,37 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target vect_int_mult } */ +/* { dg-additional-options "-fdump-tree-optimized" } */ + +void foo (int * __restrict a, int *b, int *c) +{ + for (int i = 0; i < 1024; ++i) + { + a[2*i] = b[i] + 7; + a[2*i+1] = c[i] * 3; + } +} + +int bar (int *b) +{ + int res = 0; + for (int i = 0; i < 1024; ++i) + { + res += b[2*i] + 7; + res += b[2*i+1] * 3; + } + return res; +} + +void baz (int * __restrict a, int *b) +{ + for (int i = 0; i < 1024; ++i) + { + a[2*i] = b[2*i] + 7; + a[2*i+1] = b[2*i+1] * 3; + } +} + +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" } } */ +/* { dg-final { scan-tree-dump-times "LOAD_LANES" 2 "optimized" { target vect_load_lanes } } } */ +/* { dg-final { scan-tree-dump-times "STORE_LANES" 2 "optimized" { target vect_load_lanes } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-56.c b/gcc/testsuite/gcc.dg/vect/slp-56.c new file mode 100644 index 00000000000..0b985eae55e --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/slp-56.c @@ -0,0 +1,51 @@ +#include "tree-vect.h" + +/* This is a load-lane / masked-store-lane test that more reliably + triggers SLP than SVEs mask_srtuct_store_*.c */ + +void __attribute__ ((noipa)) +test4 (int *__restrict dest, int *__restrict src, + int *__restrict cond, int bias, int n) +{ + for (int i = 0; i < n; ++i) + { + int value0 = src[i * 4] + bias; + int value1 = src[i * 4 + 1] * bias; + int value2 = src[i * 4 + 2] + bias; + int value3 = src[i * 4 + 3] * bias; + if (cond[i]) + { + dest[i * 4] = value0; + dest[i * 4 + 1] = value1; + dest[i * 4 + 2] = value2; + dest[i * 4 + 3] = value3; + } + } +} + +int dest[16*4]; +int src[16*4]; +int cond[16]; +const int dest_chk[16*4] = {0, 0, 0, 0, 9, 25, 11, 35, 0, 0, 0, 0, 17, 65, 19, + 75, 0, 0, 0, 0, 25, 105, 27, 115, 0, 0, 0, 0, 33, 145, 35, 155, 0, 0, 0, + 0, 41, 185, 43, 195, 0, 0, 0, 0, 49, 225, 51, 235, 0, 0, 0, 0, 57, 265, 59, + 275, 0, 0, 0, 0, 65, 305, 67, 315}; + +int main() +{ + check_vect (); +#pragma GCC novector + for (int i = 0; i < 16; ++i) + cond[i] = i & 1; +#pragma GCC novector + for (int i = 0; i < 16 * 4; ++i) + src[i] = i; + test4 (dest, src, cond, 5, 16); +#pragma GCC novector + for (int i = 0; i < 16 * 4; ++i) + if (dest[i] != dest_chk[i]) + abort (); + return 0; +} + +/* { dg-final { scan-tree-dump "STORE_LANES" "vect" { target { vect_variable_length && vect_load_lanes } } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-cond-1.c b/gcc/testsuite/gcc.dg/vect/slp-cond-1.c index c76ea5d17ef..16ab0cc7605 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-cond-1.c +++ b/gcc/testsuite/gcc.dg/vect/slp-cond-1.c @@ -125,5 +125,4 @@ main () return 0; } -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" { target { ! vect_load_lanes } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" { target { vect_load_lanes } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/vect-complex-5.c b/gcc/testsuite/gcc.dg/vect/vect-complex-5.c index ac562dc475c..0d850720d63 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-complex-5.c +++ b/gcc/testsuite/gcc.dg/vect/vect-complex-5.c @@ -40,5 +40,4 @@ main (void) return 0; } -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target vect_load_lanes } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { ! vect_load_lanes } xfail { ! vect_hw_misalign } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { ! vect_hw_misalign } } } } */ diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 9cc7c59f94f..bfecbc87f29 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -120,6 +120,7 @@ _slp_tree::_slp_tree () SLP_TREE_SIMD_CLONE_INFO (this) = vNULL; SLP_TREE_DEF_TYPE (this) = vect_uninitialized_def; SLP_TREE_CODE (this) = ERROR_MARK; + this->ldst_lanes = false; SLP_TREE_VECTYPE (this) = NULL_TREE; SLP_TREE_REPRESENTATIVE (this) = NULL; SLP_TREE_REF_COUNT (this) = 1; @@ -3902,10 +3903,28 @@ vect_build_slp_instance (vec_info *vinfo, /* For loop vectorization split the RHS into arbitrary pieces of size >= 1. */ else if (is_a (vinfo) - && (i > 0 && i < group_size) - && !vect_slp_prefer_store_lanes_p (vinfo, - stmt_info, group_size, i)) - { + && (i > 0 && i < group_size)) + { + /* There are targets that cannot do even/odd interleaving schemes + so they absolutely need to use load/store-lanes. For now + force single-lane SLP for them - they would be happy with + uniform power-of-two lanes (but depending on element size), + but even if we can use 'i' as indicator we would need to + backtrack when later lanes fail to discover with the same + granularity. We cannot turn any of strided or scatter store + into store-lanes. */ + /* ??? If this is not in sync with what get_load_store_type + later decides the SLP representation is not good for other + store vectorization methods. */ + bool want_store_lanes + = (! STMT_VINFO_GATHER_SCATTER_P (stmt_info) + && ! STMT_VINFO_STRIDED_P (stmt_info) + && compare_step_with_zero (vinfo, stmt_info) > 0 + && vect_slp_prefer_store_lanes_p (vinfo, stmt_info, + group_size, 1)); + if (want_store_lanes) + i = 1; + if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "Splitting SLP group at stmt %u\n", i); @@ -3939,7 +3958,10 @@ vect_build_slp_instance (vec_info *vinfo, (max_nunits, end - start)); rhs_nodes.safe_push (node); start = end; - end = group_size; + if (want_store_lanes) + end = start + 1; + else + end = group_size; } else { @@ -3973,7 +3995,31 @@ vect_build_slp_instance (vec_info *vinfo, } /* Now we assume we can build the root SLP node from all stores. */ - node = vect_build_slp_store_interleaving (rhs_nodes, scalar_stmts); + if (want_store_lanes) + { + /* For store-lanes feed the store node with all RHS nodes + in order. */ + node = vect_create_new_slp_node (scalar_stmts, + SLP_TREE_CHILDREN + (rhs_nodes[0]).length ()); + SLP_TREE_VECTYPE (node) = SLP_TREE_VECTYPE (rhs_nodes[0]); + node->ldst_lanes = true; + SLP_TREE_CHILDREN (node) + .reserve_exact (SLP_TREE_CHILDREN (rhs_nodes[0]).length () + + rhs_nodes.length () - 1); + /* First store value and possibly mask. */ + SLP_TREE_CHILDREN (node) + .splice (SLP_TREE_CHILDREN (rhs_nodes[0])); + /* Rest of the store values. All mask nodes are the same, + this should be guaranteed by dataref group discovery. */ + for (unsigned j = 1; j < rhs_nodes.length (); ++j) + SLP_TREE_CHILDREN (node) + .quick_push (SLP_TREE_CHILDREN (rhs_nodes[j])[0]); + for (slp_tree child : SLP_TREE_CHILDREN (node)) + child->refcnt++; + } + else + node = vect_build_slp_store_interleaving (rhs_nodes, scalar_stmts); while (!rhs_nodes.is_empty ()) vect_free_slp_tree (rhs_nodes.pop ()); @@ -4189,6 +4235,44 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo, if (exact_log2 (group_lanes) == -1 && group_lanes != 3) return; + /* Verify if all load permutations can be implemented with a suitably + large element load-lanes operation. */ + unsigned ld_lanes_lanes = SLP_TREE_LANES (loads[0]); + if (STMT_VINFO_STRIDED_P (first) + || compare_step_with_zero (loop_vinfo, first) <= 0 + || exact_log2 (ld_lanes_lanes) == -1 + /* ??? For now only support the single-lane case as there is + missing support on the store-lane side and code generation + isn't up to the task yet. */ + || ld_lanes_lanes != 1 + || vect_load_lanes_supported (SLP_TREE_VECTYPE (loads[0]), + group_lanes / ld_lanes_lanes, + false) == IFN_LAST) + ld_lanes_lanes = 0; + else + /* Verify the loads access the same number of lanes aligned to + ld_lanes_lanes. */ + for (slp_tree load : loads) + { + if (SLP_TREE_LANES (load) != ld_lanes_lanes) + { + ld_lanes_lanes = 0; + break; + } + unsigned first = SLP_TREE_LOAD_PERMUTATION (load)[0]; + if (first % ld_lanes_lanes != 0) + { + ld_lanes_lanes = 0; + break; + } + for (unsigned i = 1; i < SLP_TREE_LANES (load); ++i) + if (SLP_TREE_LOAD_PERMUTATION (load)[i] != first + i) + { + ld_lanes_lanes = 0; + break; + } + } + for (slp_tree load : loads) { /* Leave masked or gather loads alone for now. */ @@ -4203,7 +4287,8 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo, with a non-1:1 load permutation around instead of canonicalizing those into a load and a permute node. Removing this early check would do such canonicalization. */ - if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2) + if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2 + && ld_lanes_lanes == 0) continue; /* First build (and possibly re-use) a load node for the @@ -4236,10 +4321,20 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo, final_perm.quick_push (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i])); + if (ld_lanes_lanes != 0) + { + /* ??? If this is not in sync with what get_load_store_type + later decides the SLP representation is not good for other + store vectorization methods. */ + l0->ldst_lanes = true; + load->ldst_lanes = true; + } + while (1) { unsigned group_lanes = SLP_TREE_LANES (l0); - if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2) + if (ld_lanes_lanes != 0 + || SLP_TREE_LANES (load) >= (group_lanes + 1) / 2) break; /* Try to lower by reducing the group to half its size using an @@ -9874,6 +9969,28 @@ vectorizable_slp_permutation_1 (vec_info *vinfo, gimple_stmt_iterator *gsi, gcc_assert (perm.length () == SLP_TREE_LANES (node)); + /* Load-lanes permute. This permute only acts as a forwarder to + select the correct vector def of the load-lanes load which + has the permuted vectors in its vector defs like + { v0, w0, r0, v1, w1, r1 ... } for a ld3. */ + if (node->ldst_lanes) + { + gcc_assert (children.length () == 1); + if (!gsi) + /* This is a trivial op always supported. */ + return 1; + slp_tree child = children[0]; + unsigned vec_idx = (SLP_TREE_LANE_PERMUTATION (node)[0].second + / SLP_TREE_LANES (node)); + unsigned vec_num = SLP_TREE_LANES (child) / SLP_TREE_LANES (node); + for (unsigned i = 0; i < SLP_TREE_NUMBER_OF_VEC_STMTS (node); ++i) + { + tree def = SLP_TREE_VEC_DEFS (child)[i * vec_num + vec_idx]; + node->push_vec_def (def); + } + return 1; + } + /* REPEATING_P is true if every output vector is guaranteed to use the same permute vector. We can handle that case for both variable-length and constant-length vectors, but we only handle other cases for diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index 9eb73a59933..73ffe00f158 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -1508,7 +1508,8 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype, unsigned int nvectors; if (slp_node) - nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); + /* ??? Incorrect for multi-lane lanes. */ + nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node) / group_size; else nvectors = vect_get_num_copies (loop_vinfo, vectype); @@ -1794,7 +1795,7 @@ vect_use_strided_gather_scatters_p (stmt_vec_info stmt_info, elements with a known constant step. Return -1 if that step is negative, 0 if it is zero, and 1 if it is greater than zero. */ -static int +int compare_step_with_zero (vec_info *vinfo, stmt_vec_info stmt_info) { dr_vec_info *dr_info = STMT_VINFO_DR_INFO (stmt_info); @@ -2069,6 +2070,14 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, is irrelevant for them. */ *alignment_support_scheme = dr_unaligned_supported; } + /* Try using LOAD/STORE_LANES. */ + else if (slp_node->ldst_lanes + && (*lanes_ifn + = (vls_type == VLS_LOAD + ? vect_load_lanes_supported (vectype, group_size, masked_p) + : vect_store_lanes_supported (vectype, group_size, + masked_p))) != IFN_LAST) + *memory_access_type = VMAT_LOAD_STORE_LANES; else *memory_access_type = VMAT_CONTIGUOUS; @@ -8199,6 +8208,16 @@ vectorizable_store (vec_info *vinfo, &lanes_ifn)) return false; + if (slp_node + && slp_node->ldst_lanes + && memory_access_type != VMAT_LOAD_STORE_LANES) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "discovered store-lane but cannot use it.\n"); + return false; + } + if (mask) { if (memory_access_type == VMAT_CONTIGUOUS) @@ -8715,7 +8734,7 @@ vectorizable_store (vec_info *vinfo, else { if (memory_access_type == VMAT_LOAD_STORE_LANES) - aggr_type = build_array_type_nelts (elem_type, vec_num * nunits); + aggr_type = build_array_type_nelts (elem_type, group_size * nunits); else aggr_type = vectype; bump = vect_get_data_ptr_increment (vinfo, gsi, dr_info, aggr_type, @@ -8772,11 +8791,24 @@ vectorizable_store (vec_info *vinfo, if (memory_access_type == VMAT_LOAD_STORE_LANES) { - gcc_assert (!slp && grouped_store); + if (costing_p && slp_node) + /* Update all incoming store operand nodes, the general handling + above only handles the mask and the first store operand node. */ + for (slp_tree child : SLP_TREE_CHILDREN (slp_node)) + if (child != mask_node + && !vect_maybe_update_slp_op_vectype (child, vectype)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "incompatible vector types for invariants\n"); + return false; + } unsigned inside_cost = 0, prologue_cost = 0; /* For costing some adjacent vector stores, we'd like to cost with the total number of them once instead of cost each one by one. */ unsigned int n_adjacent_stores = 0; + if (slp) + ncopies = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node) / group_size; for (j = 0; j < ncopies; j++) { gimple *new_stmt; @@ -8794,7 +8826,7 @@ vectorizable_store (vec_info *vinfo, op = vect_get_store_rhs (next_stmt_info); if (costing_p) update_prologue_cost (&prologue_cost, op); - else + else if (!slp) { vect_get_vec_defs_for_operand (vinfo, next_stmt_info, ncopies, op, @@ -8809,15 +8841,15 @@ vectorizable_store (vec_info *vinfo, { if (mask) { - vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies, - mask, &vec_masks, - mask_vectype); + if (slp_node) + vect_get_slp_defs (mask_node, &vec_masks); + else + vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies, + mask, &vec_masks, + mask_vectype); vec_mask = vec_masks[0]; } - /* We should have catched mismatched types earlier. */ - gcc_assert ( - useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd))); dataref_ptr = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type, NULL, offset, &dummy, @@ -8829,10 +8861,16 @@ vectorizable_store (vec_info *vinfo, gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo)); /* DR_CHAIN is then used as an input to vect_permute_store_chain(). */ - for (i = 0; i < group_size; i++) + if (!slp) { - vec_oprnd = (*gvec_oprnds[i])[j]; - dr_chain[i] = vec_oprnd; + /* We should have caught mismatched types earlier. */ + gcc_assert ( + useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd))); + for (i = 0; i < group_size; i++) + { + vec_oprnd = (*gvec_oprnds[i])[j]; + dr_chain[i] = vec_oprnd; + } } if (mask) vec_mask = vec_masks[j]; @@ -8842,12 +8880,12 @@ vectorizable_store (vec_info *vinfo, if (costing_p) { - n_adjacent_stores += vec_num; + n_adjacent_stores += group_size; continue; } /* Get an array into which we can store the individual vectors. */ - tree vec_array = create_vector_array (vectype, vec_num); + tree vec_array = create_vector_array (vectype, group_size); /* Invalidate the current contents of VEC_ARRAY. This should become an RTL clobber too, which prevents the vector registers @@ -8855,9 +8893,19 @@ vectorizable_store (vec_info *vinfo, vect_clobber_variable (vinfo, stmt_info, gsi, vec_array); /* Store the individual vectors into the array. */ - for (i = 0; i < vec_num; i++) + for (i = 0; i < group_size; i++) { - vec_oprnd = dr_chain[i]; + if (slp) + { + slp_tree child; + if (i == 0 || !mask_node) + child = SLP_TREE_CHILDREN (slp_node)[i]; + else + child = SLP_TREE_CHILDREN (slp_node)[i + 1]; + vec_oprnd = SLP_TREE_VEC_DEFS (child)[j]; + } + else + vec_oprnd = dr_chain[i]; write_vector_array (vinfo, stmt_info, gsi, vec_oprnd, vec_array, i); } @@ -8927,9 +8975,10 @@ vectorizable_store (vec_info *vinfo, /* Record that VEC_ARRAY is now dead. */ vect_clobber_variable (vinfo, stmt_info, gsi, vec_array); - if (j == 0) + if (j == 0 && !slp) *vec_stmt = new_stmt; - STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt); + if (!slp) + STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt); } if (costing_p) @@ -10033,6 +10082,16 @@ vectorizable_load (vec_info *vinfo, &lanes_ifn)) return false; + if (slp_node + && slp_node->ldst_lanes + && memory_access_type != VMAT_LOAD_STORE_LANES) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "discovered load-lane but cannot use it.\n"); + return false; + } + if (mask) { if (memory_access_type == VMAT_CONTIGUOUS) @@ -10751,7 +10810,7 @@ vectorizable_load (vec_info *vinfo, else { if (memory_access_type == VMAT_LOAD_STORE_LANES) - aggr_type = build_array_type_nelts (elem_type, vec_num * nunits); + aggr_type = build_array_type_nelts (elem_type, group_size * nunits); else aggr_type = vectype; bump = vect_get_data_ptr_increment (vinfo, gsi, dr_info, aggr_type, @@ -10775,12 +10834,13 @@ vectorizable_load (vec_info *vinfo, { gcc_assert (alignment_support_scheme == dr_aligned || alignment_support_scheme == dr_unaligned_supported); - gcc_assert (grouped_load && !slp); unsigned int inside_cost = 0, prologue_cost = 0; /* For costing some adjacent vector loads, we'd like to cost with the total number of them once instead of cost each one by one. */ unsigned int n_adjacent_loads = 0; + if (slp_node) + ncopies = slp_node->vec_stmts_size / group_size; for (j = 0; j < ncopies; j++) { if (costing_p) @@ -10831,7 +10891,7 @@ vectorizable_load (vec_info *vinfo, if (mask) vec_mask = vec_masks[j]; - tree vec_array = create_vector_array (vectype, vec_num); + tree vec_array = create_vector_array (vectype, group_size); tree final_mask = NULL_TREE; tree final_len = NULL_TREE; @@ -10894,24 +10954,31 @@ vectorizable_load (vec_info *vinfo, gimple_call_set_nothrow (call, true); vect_finish_stmt_generation (vinfo, stmt_info, call, gsi); - dr_chain.create (vec_num); + if (!slp) + dr_chain.create (group_size); /* Extract each vector into an SSA_NAME. */ - for (i = 0; i < vec_num; i++) + for (unsigned i = 0; i < group_size; i++) { new_temp = read_vector_array (vinfo, stmt_info, gsi, scalar_dest, vec_array, i); - dr_chain.quick_push (new_temp); + if (slp) + slp_node->push_vec_def (new_temp); + else + dr_chain.quick_push (new_temp); } - /* Record the mapping between SSA_NAMEs and statements. */ - vect_record_grouped_load_vectors (vinfo, stmt_info, dr_chain); + if (!slp) + /* Record the mapping between SSA_NAMEs and statements. */ + vect_record_grouped_load_vectors (vinfo, stmt_info, dr_chain); /* Record that VEC_ARRAY is now dead. */ vect_clobber_variable (vinfo, stmt_info, gsi, vec_array); - dr_chain.release (); + if (!slp) + dr_chain.release (); - *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0]; + if (!slp_node) + *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0]; } if (costing_p) diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index df6c8ada2f7..699ae9e33ba 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -222,6 +222,9 @@ struct _slp_tree { unsigned int lanes; /* The operation of this node. */ enum tree_code code; + /* Whether uses of this load or feeders of this store are suitable + for load/store-lanes. */ + bool ldst_lanes; int vertex; @@ -2313,6 +2316,7 @@ extern bool supportable_indirect_convert_operation (code_helper, tree, tree, vec > *, tree = NULL_TREE); +extern int compare_step_with_zero (vec_info *, stmt_vec_info); extern unsigned record_stmt_cost (stmt_vector_for_cost *, int, enum vect_cost_for_stmt, stmt_vec_info, From patchwork Thu Aug 29 13:32:47 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Biener X-Patchwork-Id: 1978453 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256 header.s=susede2_rsa header.b=sy9utiYM; dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256 header.s=susede2_ed25519 header.b=jse0eAjD; dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de header.a=rsa-sha256 header.s=susede2_rsa header.b=gzbhv/h2; dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256 header.s=susede2_ed25519 header.b=rTz+ZQyO; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4Wvj0G69hnz1yZ9 for ; Thu, 29 Aug 2024 23:33:18 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id AB1093860C3E for ; Thu, 29 Aug 2024 13:33:16 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) by sourceware.org (Postfix) with ESMTPS id 04ECF385F01D for ; Thu, 29 Aug 2024 13:32:50 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 04ECF385F01D Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 04ECF385F01D Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=195.135.223.130 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1724938372; cv=none; b=aA+Nsd9eY6fmbRP1b/iR79pjfq7mGbGWETtEfwPLGQoZkotz2UsqkkZAV130ZDweFHA5uXncmX+/K4RGprs+PsWouPkza54xKubJvaSmfPNdokLocaVel5v7pIG7PUQ5ZHGcu26kLPnanY8hId+04oy2hQAb0djwu4SPYOPoezE= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1724938372; c=relaxed/simple; bh=bHpfGq72WR+oJqb/LslFo6UzcJaqZ2joGioSuirsKxI=; h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date: From:To:Subject:MIME-Version; b=ITiXSyfWP7nJDj8k5FHWEz9Pu5z7bjAOLEdQVFjtQQz1arWWOfT8nBB0/hBLHkNWtGNUb08urRG7Uh0iD/oO4HWP+WgVhWKVtu92V/No0z72Z/0MuGS0o50bETgx3OgpOy9dVkL3YvR6e291Aj4GHUcuwBY277xecMrPS7mtyLc= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from murzim.nue2.suse.org (unknown [10.168.4.243]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id F1E4421962; Thu, 29 Aug 2024 13:32:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1724938369; h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version: content-type:content-type; bh=zUxOTNJDDR0DK0mTGRY399FJ0QSqqs3e2GwVkIc37pE=; b=sy9utiYM96NwjjM7IB2HdVhRE27zTDAvmYZbVYwKuOSIj9ZDo4+g2lwYqYbfa6J1AwILYo f+rbsYCfT1bge3OZpehVKQqRHtn+hZCYB4I+xfg+p3BRI8Jbum9uUCxCYrSfiQUf4E8nBA CDDJccbDsIHZf2hzOh3+nF22ljbrEEk= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1724938369; h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version: content-type:content-type; bh=zUxOTNJDDR0DK0mTGRY399FJ0QSqqs3e2GwVkIc37pE=; b=jse0eAjDChC5AB/peeWjRc7PbzHaUtD1r7U4Y29zjyu1qXRVh8bKOxxCz0XWDOLhDSuP16 O9z5aYd/qgJrSHDA== Authentication-Results: smtp-out1.suse.de; none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1724938368; h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version: content-type:content-type; bh=zUxOTNJDDR0DK0mTGRY399FJ0QSqqs3e2GwVkIc37pE=; b=gzbhv/h2VzrmbHp+8ArZn63LFFmSYVmLAuF9g68dOGasSn7Fkg8qcR38VoEn79C3cSR9Vw K4fH5/Vhlcz24kD8Zw3xv0PQWtcDav5OXtQAUdWLOdXQ6FYzBNFYJtbXN2Twm0Gy5gjLcc A5aokCgLG5c9vr51CWCte/HEEDU05wE= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1724938368; h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version: content-type:content-type; bh=zUxOTNJDDR0DK0mTGRY399FJ0QSqqs3e2GwVkIc37pE=; b=rTz+ZQyOrv9Mq7zpEk1mu1p/2tZGCyVsj8IfWyZE7Nyf/FpSQ7fQcDCWbIF7wv/WxBU0ed Ut0x2sHSJmB6nXAw== Date: Thu, 29 Aug 2024 15:32:47 +0200 (CEST) From: Richard Biener To: gcc-patches@gcc.gnu.org cc: richard.sandiford@arm.com Subject: [PATCH 3/4] Do not cancel SLP because of load/store-lanes from loop analysis MIME-Version: 1.0 X-Spam-Level: X-Spamd-Result: default: False [-1.51 / 50.00]; BAYES_HAM(-3.00)[99.99%]; MISSING_MID(2.50)[]; NEURAL_HAM_LONG(-0.74)[-0.742]; NEURAL_HAM_SHORT(-0.17)[-0.830]; MIME_GOOD(-0.10)[text/plain]; FROM_HAS_DN(0.00)[]; RCVD_COUNT_ZERO(0.00)[0]; MISSING_XM_UA(0.00)[]; ARC_NA(0.00)[]; RCPT_COUNT_TWO(0.00)[2]; DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; TO_MATCH_ENVRCPT_ALL(0.00)[]; FUZZY_BLOCKED(0.00)[rspamd.com]; FROM_EQ_ENVFROM(0.00)[]; TO_DN_NONE(0.00)[]; MIME_TRACE(0.00)[0:+] X-Spam-Score: -1.51 X-Spam-Status: No, score=-10.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, MISSING_MID, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org Message-Id: <20240829133316.AB1093860C3E@sourceware.org> The following removes the code cancelling SLP if we can use load/store-lanes from the main loop vector analysis code. * tree-vect-loop.cc (vect_analyze_loop_2): Do not cancel SLP when store-lanes can be used. --- gcc/tree-vect-loop.cc | 76 ------------------------------------------- 1 file changed, 76 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 6456220cdc9..8290776ec2c 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -2957,82 +2957,6 @@ start_over: "unsupported SLP instances\n"); goto again; } - - /* Check whether any load in ALL SLP instances is possibly permuted. */ - slp_tree load_node, slp_root; - unsigned i, x; - slp_instance instance; - bool can_use_lanes = true; - FOR_EACH_VEC_ELT (LOOP_VINFO_SLP_INSTANCES (loop_vinfo), x, instance) - { - slp_root = SLP_INSTANCE_TREE (instance); - int group_size = SLP_TREE_LANES (slp_root); - tree vectype = SLP_TREE_VECTYPE (slp_root); - bool loads_permuted = false; - FOR_EACH_VEC_ELT (SLP_INSTANCE_LOADS (instance), i, load_node) - { - if (!SLP_TREE_LOAD_PERMUTATION (load_node).exists ()) - continue; - unsigned j; - stmt_vec_info load_info; - FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (load_node), j, load_info) - if (SLP_TREE_LOAD_PERMUTATION (load_node)[j] != j) - { - loads_permuted = true; - break; - } - } - - /* If the loads and stores can be handled with load/store-lane - instructions record it and move on to the next instance. */ - if (loads_permuted - && SLP_INSTANCE_KIND (instance) == slp_inst_kind_store - && vect_store_lanes_supported (vectype, group_size, false) - != IFN_LAST) - { - FOR_EACH_VEC_ELT (SLP_INSTANCE_LOADS (instance), i, load_node) - if (STMT_VINFO_GROUPED_ACCESS - (SLP_TREE_REPRESENTATIVE (load_node))) - { - stmt_vec_info stmt_vinfo = DR_GROUP_FIRST_ELEMENT - (SLP_TREE_REPRESENTATIVE (load_node)); - /* Use SLP for strided accesses (or if we can't - load-lanes). */ - if (STMT_VINFO_STRIDED_P (stmt_vinfo) - || vect_load_lanes_supported - (STMT_VINFO_VECTYPE (stmt_vinfo), - DR_GROUP_SIZE (stmt_vinfo), false) == IFN_LAST) - break; - } - - can_use_lanes - = can_use_lanes && i == SLP_INSTANCE_LOADS (instance).length (); - - if (can_use_lanes && dump_enabled_p ()) - dump_printf_loc (MSG_NOTE, vect_location, - "SLP instance %p can use load/store-lanes\n", - (void *) instance); - } - else - { - can_use_lanes = false; - break; - } - } - - /* If all SLP instances can use load/store-lanes abort SLP and try again - with SLP disabled. */ - if (can_use_lanes) - { - ok = opt_result::failure_at (vect_location, - "Built SLP cancelled: can use " - "load/store-lanes\n"); - if (dump_enabled_p ()) - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, - "Built SLP cancelled: all SLP instances support " - "load/store-lanes\n"); - goto again; - } } /* Dissolve SLP-only groups. */ From patchwork Thu Aug 29 13:32:58 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Biener X-Patchwork-Id: 1978454 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256 header.s=susede2_rsa header.b=I4WGHng7; dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256 header.s=susede2_ed25519 header.b=9sO4Jiku; dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de header.a=rsa-sha256 header.s=susede2_rsa header.b=I4WGHng7; dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256 header.s=susede2_ed25519 header.b=9sO4Jiku; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=8.43.85.97; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4Wvj0L4RpNz1yZ9 for ; Thu, 29 Aug 2024 23:33:22 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id DAE4C3861009 for ; Thu, 29 Aug 2024 13:33:20 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from smtp-out1.suse.de (smtp-out1.suse.de [IPv6:2a07:de40:b251:101:10:150:64:1]) by sourceware.org (Postfix) with ESMTPS id 3A3E13861012 for ; Thu, 29 Aug 2024 13:32:59 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 3A3E13861012 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 3A3E13861012 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a07:de40:b251:101:10:150:64:1 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1724938381; cv=none; b=fkLRVN2Vc/9xspHflEdrpBbYNuXn/JUvtRuYCwaYsY9vL4wydVbTvQHcumOofEMexyjlsv1Nk+wuOuWjTqIp9KAc82RX4jJqbkEZYlDyRcRLzLNzmbAZD/WMGDgQNezH6qQ0Ob+Gj7BEZT2le3k++ReFJOsoI9ivuo7MjeIJ9kY= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1724938381; c=relaxed/simple; bh=Im45x9L/pd3xU4rEU0KtWTuCzjJ1Jk3OG9tsi1vEg3s=; h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date: From:To:Subject:MIME-Version; b=ilYar4dDgO6jxocgNbCAZPbnu+8iKMdhTrEZOG6SZdLHKJcf6gEfi3zlMjyyqaC/fbY2blzdRyhAhkEyBhPrUAkhph3dmYgTVOng/oj9xr0CFqURjBT4y5bnoVDTSwJkKbgCZfHJUvH5hgj2wSZ8ytYH5WyeZ70+rBTalvInL5M= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from murzim.nue2.suse.org (unknown [10.168.4.243]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 22D3E21962; Thu, 29 Aug 2024 13:32:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1724938378; h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version: content-type:content-type; bh=82tuMm0McId9Eh4+wXnbwGzia2pxTWmTh1KYLyk1skA=; b=I4WGHng7QPIMO87nZgkEebZ/DYXs8peMijC7lCoA9Zordru5S8r3KbB8hV2tIIZHT+iFZj AW6e4x+HMsN0g/DvYt9ZwB1PF5IQrnOCMvhgbfGSZVlFj8GYRCufW902H3jWKHUvBW/9oc vGCfQEpCfCL+cfnWGCJwXNvL33aybOo= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1724938378; h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version: content-type:content-type; bh=82tuMm0McId9Eh4+wXnbwGzia2pxTWmTh1KYLyk1skA=; b=9sO4Jikude84LwKMxAz6KJYBV2XN4IRXwvUqP3Vl84qwUM2ld8H10Yb6IflEG4Tjn7Dog+ ehZM/ycg8KicWCDg== Authentication-Results: smtp-out1.suse.de; none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1724938378; h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version: content-type:content-type; bh=82tuMm0McId9Eh4+wXnbwGzia2pxTWmTh1KYLyk1skA=; b=I4WGHng7QPIMO87nZgkEebZ/DYXs8peMijC7lCoA9Zordru5S8r3KbB8hV2tIIZHT+iFZj AW6e4x+HMsN0g/DvYt9ZwB1PF5IQrnOCMvhgbfGSZVlFj8GYRCufW902H3jWKHUvBW/9oc vGCfQEpCfCL+cfnWGCJwXNvL33aybOo= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1724938378; h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version: content-type:content-type; bh=82tuMm0McId9Eh4+wXnbwGzia2pxTWmTh1KYLyk1skA=; b=9sO4Jikude84LwKMxAz6KJYBV2XN4IRXwvUqP3Vl84qwUM2ld8H10Yb6IflEG4Tjn7Dog+ ehZM/ycg8KicWCDg== Date: Thu, 29 Aug 2024 15:32:58 +0200 (CEST) From: Richard Biener To: gcc-patches@gcc.gnu.org cc: richard.sandiford@arm.com Subject: [PATCH 4/4] RISC-V: Re-implement SLP single-lane rediscovery for SLP MIME-Version: 1.0 X-Spam-Score: -1.49 X-Spamd-Result: default: False [-1.49 / 50.00]; BAYES_HAM(-3.00)[100.00%]; MISSING_MID(2.50)[]; NEURAL_HAM_LONG(-0.72)[-0.723]; NEURAL_HAM_SHORT(-0.17)[-0.829]; MIME_GOOD(-0.10)[text/plain]; RCPT_COUNT_TWO(0.00)[2]; FUZZY_BLOCKED(0.00)[rspamd.com]; ARC_NA(0.00)[]; DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; TO_MATCH_ENVRCPT_ALL(0.00)[]; FROM_HAS_DN(0.00)[]; MISSING_XM_UA(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; TO_DN_NONE(0.00)[]; RCVD_COUNT_ZERO(0.00)[0]; DBL_BLOCKED_OPENRESOLVER(0.00)[tree-vect-slp.cc:url] X-Spam-Level: X-Spam-Status: No, score=-10.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, MISSING_MID, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org Message-Id: <20240829133320.DAE4C3861009@sourceware.org> This re-implements scrapping of SLP in case load/store-lanes is possible to be used as re-discovering the SLP instance with forced single-lane splits so SLP load/store-lanes scheme can be used. This is now done after SLP discovery and SLP pattern recog are complete to not disturb the latter but per SLP instance instead of being a global decision on the whole loop. * tree-vect-slp.cc (vect_analyze_slp): After SLP pattern recog is finished see if there are any SLP instances that would benefit from using load/store-lanes and re-discover those with forced single lanes. --- gcc/tree-vect-slp.cc | 117 +++++++++++++++++++++++++++++++++++++++---- 1 file changed, 108 insertions(+), 9 deletions(-) diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index bfecbc87f29..852937cdc34 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -3481,7 +3481,8 @@ static bool vect_analyze_slp_instance (vec_info *vinfo, scalar_stmts_to_slp_tree_map_t *bst_map, stmt_vec_info stmt_info, slp_instance_kind kind, - unsigned max_tree_size, unsigned *limit); + unsigned max_tree_size, unsigned *limit, + bool force_single_lane = false); /* Build an interleaving scheme for the store sources RHS_NODES from SCALAR_STMTS. */ @@ -3676,7 +3677,8 @@ vect_build_slp_instance (vec_info *vinfo, unsigned max_tree_size, unsigned *limit, scalar_stmts_to_slp_tree_map_t *bst_map, /* ??? We need stmt_info for group splitting. */ - stmt_vec_info stmt_info_) + stmt_vec_info stmt_info_, + bool force_single_lane = false) { /* If there's no budget left bail out early. */ if (*limit == 0) @@ -3705,9 +3707,17 @@ vect_build_slp_instance (vec_info *vinfo, poly_uint64 max_nunits = 1; unsigned tree_size = 0; unsigned i; - slp_tree node = vect_build_slp_tree (vinfo, scalar_stmts, group_size, - &max_nunits, matches, limit, - &tree_size, bst_map); + + slp_tree node = NULL; + if (force_single_lane) + { + matches[0] = true; + matches[1] = false; + } + else + node = vect_build_slp_tree (vinfo, scalar_stmts, group_size, + &max_nunits, matches, limit, + &tree_size, bst_map); if (node != NULL) { /* Calculate the unrolling factor based on the smallest type. */ @@ -3922,7 +3932,7 @@ vect_build_slp_instance (vec_info *vinfo, && compare_step_with_zero (vinfo, stmt_info) > 0 && vect_slp_prefer_store_lanes_p (vinfo, stmt_info, group_size, 1)); - if (want_store_lanes) + if (want_store_lanes || force_single_lane) i = 1; if (dump_enabled_p ()) @@ -3958,7 +3968,7 @@ vect_build_slp_instance (vec_info *vinfo, (max_nunits, end - start)); rhs_nodes.safe_push (node); start = end; - if (want_store_lanes) + if (want_store_lanes || force_single_lane) end = start + 1; else end = group_size; @@ -4086,7 +4096,8 @@ vect_analyze_slp_instance (vec_info *vinfo, scalar_stmts_to_slp_tree_map_t *bst_map, stmt_vec_info stmt_info, slp_instance_kind kind, - unsigned max_tree_size, unsigned *limit) + unsigned max_tree_size, unsigned *limit, + bool force_single_lane) { vec scalar_stmts; @@ -4131,7 +4142,7 @@ vect_analyze_slp_instance (vec_info *vinfo, roots, remain, max_tree_size, limit, bst_map, kind == slp_inst_kind_store - ? stmt_info : NULL); + ? stmt_info : NULL, force_single_lane); /* ??? If this is slp_inst_kind_store and the above succeeded here's where we should do store group splitting. */ @@ -4662,6 +4673,94 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size) } } + /* Check whether we should force some SLP instances to use load/store-lanes + and do so by forcing SLP re-discovery with single lanes. We used + to cancel SLP when this applied to all instances in a loop but now + we decide this per SLP instance. It's important to do this only + after SLP pattern recognition. */ + if (is_a (vinfo)) + FOR_EACH_VEC_ELT (LOOP_VINFO_SLP_INSTANCES (vinfo), i, instance) + if (SLP_INSTANCE_KIND (instance) == slp_inst_kind_store + && !SLP_INSTANCE_TREE (instance)->ldst_lanes) + { + slp_tree slp_root = SLP_INSTANCE_TREE (instance); + int group_size = SLP_TREE_LANES (slp_root); + tree vectype = SLP_TREE_VECTYPE (slp_root); + + auto_vec loads; + hash_set visited; + vect_gather_slp_loads (loads, slp_root, visited); + + /* Check whether any load in the SLP instance is possibly + permuted. */ + bool loads_permuted = false; + slp_tree load_node; + unsigned j; + FOR_EACH_VEC_ELT (loads, j, load_node) + { + if (!SLP_TREE_LOAD_PERMUTATION (load_node).exists ()) + continue; + unsigned k; + stmt_vec_info load_info; + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (load_node), k, load_info) + if (SLP_TREE_LOAD_PERMUTATION (load_node)[k] != k) + { + loads_permuted = true; + break; + } + } + + /* If the loads and stores can use load/store-lanes force re-discovery + with single lanes. */ + if (loads_permuted + && !slp_root->ldst_lanes + && vect_store_lanes_supported (vectype, group_size, false) + != IFN_LAST) + { + bool can_use_lanes = true; + FOR_EACH_VEC_ELT (loads, j, load_node) + if (STMT_VINFO_GROUPED_ACCESS + (SLP_TREE_REPRESENTATIVE (load_node))) + { + stmt_vec_info stmt_vinfo = DR_GROUP_FIRST_ELEMENT + (SLP_TREE_REPRESENTATIVE (load_node)); + /* Use SLP for strided accesses (or if we can't + load-lanes). */ + if (STMT_VINFO_STRIDED_P (stmt_vinfo) + || compare_step_with_zero (vinfo, stmt_vinfo) <= 0 + || vect_load_lanes_supported + (STMT_VINFO_VECTYPE (stmt_vinfo), + DR_GROUP_SIZE (stmt_vinfo), false) == IFN_LAST) + { + can_use_lanes = false; + break; + } + } + + if (can_use_lanes) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "SLP instance %p can use load/store-lanes," + " re-discovering with single-lanes\n", + (void *) instance); + + stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (slp_root); + + vect_free_slp_instance (instance); + limit = max_tree_size; + bool res = vect_analyze_slp_instance (vinfo, bst_map, + stmt_info, + slp_inst_kind_store, + max_tree_size, &limit, + true); + gcc_assert (res); + auto new_inst = LOOP_VINFO_SLP_INSTANCES (vinfo).pop (); + LOOP_VINFO_SLP_INSTANCES (vinfo)[i] = new_inst; + } + } + } + /* When we end up with load permutations that we cannot possibly handle, like those requiring three vector inputs, lower them using interleaving like schemes. */