From patchwork Wed Jul 3 13:23:42 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Biener X-Patchwork-Id: 1956238 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256 header.s=susede2_rsa header.b=oveqKnmu; dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256 header.s=susede2_ed25519 header.b=jMyvA0Jx; dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de header.a=rsa-sha256 header.s=susede2_rsa header.b=oveqKnmu; dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256 header.s=susede2_ed25519 header.b=jMyvA0Jx; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4WDgV93HMPz1xqh for ; Wed, 3 Jul 2024 23:24:17 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id AD3A3384A48D for ; Wed, 3 Jul 2024 13:24:15 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from smtp-out1.suse.de (smtp-out1.suse.de [IPv6:2a07:de40:b251:101:10:150:64:1]) by sourceware.org (Postfix) with ESMTPS id BFA6B3860769 for ; Wed, 3 Jul 2024 13:23:43 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org BFA6B3860769 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de ARC-Filter: OpenARC Filter v1.0.0 sourceware.org BFA6B3860769 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a07:de40:b251:101:10:150:64:1 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1720013026; cv=none; b=vH8f2827H4cPFsIS/UQ1iYcCTrbfrW9/+izSWixso0Po+Ya4sK+mJm3y3U5zCl9829uq9V6RK9r9ii6EY+4RpxWwmd1ZhNRd4QJ+UNUq/0Obyrd5faczZavISVmui1nQBujsXxRner1iH+xY58Gt5vFGWXuMnRu97t0f3aKD9Ao= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1720013026; c=relaxed/simple; bh=IODeveUpQfczYYRWt7sO863/KtyKaaaQjh0QgvQf+58=; h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date: From:To:Subject:MIME-Version; b=IUDAr0W84EtW+3kj9vtsJEAClZ1NqueHhuGbFhW3wQcqO1FxHYTJrUcxK9a7mEkW75m7Bb4EN/owXL1ywCFr28NuMQFRggZ7P4FNNkNHXBrQ6oTUf9GaPm7g5zUBjLYz1knSZ1U3LMcE248SrS5xd4Lq/8CaLJSzi5B2NupDMf4= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from murzim.nue2.suse.org (unknown [10.168.4.243]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id ACE7A21BD0 for ; Wed, 3 Jul 2024 13:23:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1720013022; h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version: content-type:content-type; bh=e02ZWnynlvNUbSpJVV9hHE8iE29D4XcySzVIihUUtcM=; b=oveqKnmubiNwPSvMeIwdnC87LBwZNivso2/g6dSw0EAGVECAyfgkxUaBKHltBSj5x966bN oBbPg8WAMALi+LZTsRliGpopYYQPPgaUxAxqh2IAGYtQrcPuCXEekXbjcldIHSfXLvvGsi UMN8eQ9Gi4pcWW7ySc1z9HyJkmoBgpA= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1720013022; h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version: content-type:content-type; bh=e02ZWnynlvNUbSpJVV9hHE8iE29D4XcySzVIihUUtcM=; b=jMyvA0JxMeOgUF2BYeS5AuOCEZMjIyOM6LUTfkCce1n2xmWz0kp/9mkSlf2ptqQo+U5H8R xVSsxz0oDRk8HGAg== Authentication-Results: smtp-out1.suse.de; none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1720013022; h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version: content-type:content-type; bh=e02ZWnynlvNUbSpJVV9hHE8iE29D4XcySzVIihUUtcM=; b=oveqKnmubiNwPSvMeIwdnC87LBwZNivso2/g6dSw0EAGVECAyfgkxUaBKHltBSj5x966bN oBbPg8WAMALi+LZTsRliGpopYYQPPgaUxAxqh2IAGYtQrcPuCXEekXbjcldIHSfXLvvGsi UMN8eQ9Gi4pcWW7ySc1z9HyJkmoBgpA= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1720013022; h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version: content-type:content-type; bh=e02ZWnynlvNUbSpJVV9hHE8iE29D4XcySzVIihUUtcM=; b=jMyvA0JxMeOgUF2BYeS5AuOCEZMjIyOM6LUTfkCce1n2xmWz0kp/9mkSlf2ptqQo+U5H8R xVSsxz0oDRk8HGAg== Date: Wed, 3 Jul 2024 15:23:42 +0200 (CEST) From: Richard Biener To: gcc-patches@gcc.gnu.org Subject: [PATCH 1/5] lower SLP load permutation to interleaving MIME-Version: 1.0 X-Spam-Score: -0.41 X-Spam-Level: X-Spamd-Result: default: False [-0.41 / 50.00]; BAYES_HAM(-3.00)[100.00%]; MISSING_MID(2.50)[]; NEURAL_SPAM_LONG(0.39)[0.112]; NEURAL_HAM_SHORT(-0.20)[-1.000]; MIME_GOOD(-0.10)[text/plain]; RCPT_COUNT_ONE(0.00)[1]; RCVD_COUNT_ZERO(0.00)[0]; ARC_NA(0.00)[]; MISSING_XM_UA(0.00)[]; DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; FUZZY_BLOCKED(0.00)[rspamd.com]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; TO_DN_NONE(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; FROM_HAS_DN(0.00)[] X-Spam-Status: No, score=-10.9 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, MISSING_MID, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org Message-Id: <20240703132415.AD3A3384A48D@sourceware.org> The following emulates classical interleaving for SLP load permutes that we are unlikely handling natively. This is to handle cases where interleaving (or load/store-lanes) is the optimal choice for vectorizing even when we are doing that within SLP. An example would be void foo (int * __restrict a, int * b) { for (int i = 0; i < 16; ++i) { a[4*i + 0] = b[4*i + 0] * 3; a[4*i + 1] = b[4*i + 1] + 3; a[4*i + 2] = (b[4*i + 2] * 3 + 3); a[4*i + 3] = b[4*i + 3] * 3; } } where currently the SLP store is merging four single-lane SLP sub-graphs but none of the loads in it can be code-generated with V4SImode vectors and a VF of four as the permutes would need three vectors. The patch introduces a lowering phase after SLP discovery but before SLP pattern recognition or permute optimization that analyzes all loads from the same dataref group and creates an interleaving scheme starting from an unpermuted load. What can be handled is quite restrictive, matching only a subset of the non-SLP interleaving cases (the power-of-two group size ones, in addition only cases without gaps). The interleaving vectorization in addition can handle size 3 and 5 - but I am not sure if it's possible to do that in a VL agnostic way. It should be still possible to set up the SLP graph in a way that a load-lane could be matched from SLP pattern recognition. As said gaps are currently not handled - for SLP we have a representational issue that SLP_TREE_SCALAR_STMTS for "gap lanes" would need to be filled in some way (even if we just push NULL). The patch misses multi-level even/odd handling as well as CSEing intermediate generated permutes. Both is quite straight-forward to add, but eventually there's a better or more general strategy for lowering? The main goal of the patch is to avoid falling back to non-SLP for cases the interleaving code handles. * tree-vect-slp.cc (vllp_cmp): New function. (vect_lower_load_permutations): Likewise. (vect_analyze_slp): Call it. --- gcc/tree-vect-slp.cc | 287 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 287 insertions(+) diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index b88bba44760..1d5c4d99549 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -3889,6 +3889,287 @@ vect_analyze_slp_instance (vec_info *vinfo, return res; } +/* qsort comparator ordering SLP load nodes. */ + +static int +vllp_cmp (const void *a_, const void *b_) +{ + const slp_tree a = *(const slp_tree *)a_; + const slp_tree b = *(const slp_tree *)b_; + stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (a)[0]; + stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (b)[0]; + if (STMT_VINFO_GROUPED_ACCESS (a0) + && STMT_VINFO_GROUPED_ACCESS (b0) + && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0)) + { + /* Same group, order after lanes used. */ + if (SLP_TREE_LANES (a) < SLP_TREE_LANES (b)) + return 1; + else if (SLP_TREE_LANES (a) > SLP_TREE_LANES (b)) + return -1; + else + { + /* Try to order loads using the same lanes together, breaking + the tie with the lane number that first differs. */ + if (!SLP_TREE_LOAD_PERMUTATION (a).exists () + && !SLP_TREE_LOAD_PERMUTATION (b).exists ()) + return 0; + else if (SLP_TREE_LOAD_PERMUTATION (a).exists () + && !SLP_TREE_LOAD_PERMUTATION (b).exists ()) + return 1; + else if (!SLP_TREE_LOAD_PERMUTATION (a).exists () + && SLP_TREE_LOAD_PERMUTATION (b).exists ()) + return -1; + else + { + for (unsigned i = 0; i < SLP_TREE_LANES (a); ++i) + if (SLP_TREE_LOAD_PERMUTATION (a)[i] + != SLP_TREE_LOAD_PERMUTATION (b)[i]) + { + /* In-order lane first, that's what the above case for + no permutation does. */ + if (SLP_TREE_LOAD_PERMUTATION (a)[i] == i) + return -1; + else if (SLP_TREE_LOAD_PERMUTATION (b)[i] == i) + return 1; + else if (SLP_TREE_LOAD_PERMUTATION (a)[i] + < SLP_TREE_LOAD_PERMUTATION (b)[i]) + return -1; + else + return 1; + } + return 0; + } + } + } + else /* Different groups or non-groups. */ + { + /* Order groups as their first element to keep them together. */ + if (STMT_VINFO_GROUPED_ACCESS (a0)) + a0 = DR_GROUP_FIRST_ELEMENT (a0); + if (STMT_VINFO_GROUPED_ACCESS (b0)) + b0 = DR_GROUP_FIRST_ELEMENT (b0); + if (a0 == b0) + return 0; + /* Tie using UID. */ + else if (gimple_uid (STMT_VINFO_STMT (a0)) + < gimple_uid (STMT_VINFO_STMT (b0))) + return -1; + else + { + gcc_assert (gimple_uid (STMT_VINFO_STMT (a0)) + != gimple_uid (STMT_VINFO_STMT (b0))); + return 1; + } + } +} + +/* Process the set of LOADS that are all from the same dataref group. */ + +static void +vect_lower_load_permutations (loop_vec_info loop_vinfo, + scalar_stmts_to_slp_tree_map_t *bst_map, + const array_slice &loads) +{ + /* We at this point want to lower without a fixed VF or vector + size in mind which means we cannot actually compute whether we + need three or more vectors for a load permutation yet. So always + lower. */ + stmt_vec_info first + = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]); + + /* ??? In principle we have to consider a gap up to the next full + vector, but we have to actually represent a scalar stmt for the + gaps value so delay handling this. The same is true for + inbetween gaps which the load places in the load-permutation + represent. It's probably not worth trying an intermediate packing + to vectors without gap even if that might handle some more cases. + Instead get the gap case correct in some way. */ + unsigned group_lanes = 0; + for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s)) + { + if ((s == first && DR_GROUP_GAP (s) != 0) + || (s != first && DR_GROUP_GAP (s) != 1)) + return; + group_lanes++; + } + /* Only a power-of-two number of lanes matches interleaving. */ + if (exact_log2 (group_lanes) == -1) + return; + + for (slp_tree load : loads) + { + /* Leave masked or gather loads alone for now. */ + if (!SLP_TREE_CHILDREN (load).is_empty ()) + continue; + + /* We need to lower only loads of less than half of the groups + lanes, including duplicate lanes. */ + if (SLP_TREE_LANES (load) >= group_lanes / 2) + continue; + + /* Lower by reducing the group to half its size using an + interleaving scheme. For this try to compute whether all + elements needed for this load are in even or odd elements of + an even/odd decomposition with N consecutive elements. + Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition + with N == 2. */ + unsigned even = (1 << ceil_log2 (DR_GROUP_SIZE (first))) - 1; + unsigned odd = even; + for (unsigned l : SLP_TREE_LOAD_PERMUTATION (load)) + { + even &= ~l; + odd &= l; + } + /* Give up when this doesn't match up with an interleaving scheme. */ + if (!even && !odd) + continue; + + /* First build (and possibly re-use) a load node for the + unpermuted group. */ + vec stmts; + stmts.create (group_lanes); + for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s)) + stmts.quick_push (s); + poly_uint64 max_nunits; + bool *matches = XALLOCAVEC (bool, group_lanes); + unsigned limit = 1; + unsigned tree_size = 0; + slp_tree l0 = vect_build_slp_tree (loop_vinfo, stmts, + group_lanes, + &max_nunits, matches, &limit, + &tree_size, bst_map); + + /* Build the permute to get the original load permutation order. */ + lane_permutation_t final_perm; + final_perm.create (SLP_TREE_LANES (load)); + for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i) + final_perm.quick_push + (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i])); + + /* Now build an even or odd extraction from the unpermuted load. */ + lane_permutation_t perm; + perm.create (group_lanes / 2); + unsigned level; + if (even + && ((level = 1 << ctz_hwi (even)), true) + && group_lanes % (2 * level) == 0) + { + /* { 0, 1, ... 4, 5 ..., } */ + unsigned level = 1 << ctz_hwi (even); + for (unsigned i = 0; i < group_lanes / 2 / level; ++i) + for (unsigned j = 0; j < level; ++j) + perm.quick_push (std::make_pair (0, 2 * i * level + j)); + } + else if (odd) + { + /* { ..., 2, 3, ... 6, 7 } */ + unsigned level = 1 << ctz_hwi (odd); + gcc_assert (group_lanes % (2 * level) == 0); + for (unsigned i = 0; i < group_lanes / 2 / level; ++i) + for (unsigned j = 0; j < level; ++j) + perm.quick_push (std::make_pair (0, (2 * i + 1) * level + j)); + } + else + gcc_unreachable (); + + /* And update final_perm. */ + for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i) + { + unsigned l = final_perm[i].second; + unsigned j; + for (j = 0; j < perm.length (); ++j) + if (perm[j].second == l) + { + final_perm[i].second = j; + break; + } + gcc_assert (j < perm.length ()); + } + + /* And create scalar stmts. */ + vec perm_stmts; + perm_stmts.create (perm.length ()); + for (unsigned i = 0; i < perm.length (); ++i) + perm_stmts.quick_push (SLP_TREE_SCALAR_STMTS (l0)[perm[i].second]); + + slp_tree p = vect_create_new_slp_node (1, VEC_PERM_EXPR); + SLP_TREE_CHILDREN (p).quick_push (l0); + SLP_TREE_LANE_PERMUTATION (p) = perm; + SLP_TREE_VECTYPE (p) = SLP_TREE_VECTYPE (load); + SLP_TREE_LANES (p) = perm.length (); + SLP_TREE_REPRESENTATIVE (p) = SLP_TREE_REPRESENTATIVE (load); + /* ??? As we have scalar stmts for this intermediate permute we + could CSE it via bst_map but we do not want to pick up + another SLP node with a load permutation. We instead should + have a "local" CSE map here. */ + SLP_TREE_SCALAR_STMTS (p) = perm_stmts; + /* ??? We need to iterate if group_lanes / 2 is still too large. */ + /* ??? Ideally pick the best even/odd scheme usable for + most of the loads. -> do a multi-step scheme? */ + + /* And finally from the ordered reduction node create the + permute to shuffle the lanes into the original load-permutation + order. We replace the original load node with this. */ + SLP_TREE_CODE (load) = VEC_PERM_EXPR; + SLP_TREE_LOAD_PERMUTATION (load).release (); + SLP_TREE_LANE_PERMUTATION (load) = final_perm; + SLP_TREE_CHILDREN (load).create (1); + SLP_TREE_CHILDREN (load).quick_push (p); + } +} + +/* Transform SLP loads in the SLP graph created by SLP discovery to + group loads from the same group and lower load permutations that + are unlikely to be supported into a series of permutes. + In the degenerate case of having only single-lane SLP instances + this should result in a series of permute nodes emulating an + interleaving scheme. */ + +static void +vect_lower_load_permutations (loop_vec_info loop_vinfo, + scalar_stmts_to_slp_tree_map_t *bst_map) +{ + /* Gather and sort loads across all instances. */ + hash_set visited; + auto_vec loads; + for (auto inst : loop_vinfo->slp_instances) + vect_gather_slp_loads (loads, SLP_INSTANCE_TREE (inst), visited); + if (loads.is_empty ()) + return; + loads.qsort (vllp_cmp); + + /* Now process each dataref group separately. */ + unsigned firsti = 0; + for (unsigned i = 1; i < loads.length (); ++i) + { + slp_tree first = loads[firsti]; + slp_tree next = loads[i]; + stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (first)[0]; + stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (next)[0]; + if (STMT_VINFO_GROUPED_ACCESS (a0) + && STMT_VINFO_GROUPED_ACCESS (b0) + && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0)) + continue; + /* Just one SLP load of a possible group, leave those alone. */ + if (i == firsti + 1) + { + firsti = i; + continue; + } + /* Now we have multiple SLP loads of the same group from + firsti to i - 1. */ + vect_lower_load_permutations (loop_vinfo, bst_map, + make_array_slice (&loads[firsti], + i - firsti)); + firsti = i; + } + if (firsti < loads.length () - 1) + vect_lower_load_permutations (loop_vinfo, bst_map, + make_array_slice (&loads[firsti], + loads.length () - firsti)); +} + /* Check if there are stmts in the loop can be vectorized using SLP. Build SLP trees of packed scalar stmts if SLP is possible. */ @@ -4030,6 +4311,12 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size) } } + /* When we end up with load permutations that we cannot possibly handle, + like those requiring three vector inputs, lower them using interleaving + like schemes. */ + if (loop_vec_info loop_vinfo = dyn_cast (vinfo)) + vect_lower_load_permutations (loop_vinfo, bst_map); + hash_set visited_patterns; slp_tree_to_load_perm_map_t perm_cache; slp_compat_nodes_map_t compat_cache;