From patchwork Fri Oct 4 10:40:52 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Sandiford X-Patchwork-Id: 1992697 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4XKlWQ4stRz1xt7 for ; Fri, 4 Oct 2024 20:43:14 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 8058F386D61C for ; Fri, 4 Oct 2024 10:43:12 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id A44A0385E45A for ; Fri, 4 Oct 2024 10:41:37 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A44A0385E45A Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org A44A0385E45A Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1728038499; cv=none; b=jNL6IbG0yIw1lu/6CxWb81R6tapeOshodIEyMIlXlGShjgMcfWFr7ZRW+4ilKiGrSEPFUiFKTovWKG/h2O6l96dXaKfjOCl4ObP9HeGdaPcIzdTPysjR8xYRIntEXXSoGpQXD5Ne7wPyUKW+XYcJ3Dj01JQ0s2kxRghNSYLasOc= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1728038499; c=relaxed/simple; bh=HAFVfg6if4YF+Q77MKVgzkjX3J/UdEwDkeV2L4TKSzI=; h=From:To:Subject:Date:Message-Id:MIME-Version; b=IPXBqfbDIHHlGfC3JhXAtCtA5hUK9agGmN6FdDZmGV4/o99QVBq8zX/8yQgLP97UUPDv+8lws936+LByy9MRNYh/xePe9cLSyxhRRs5wBAMH9fInBHrEB8Vvo7g+mIy424dqhZ6pC4vHuxbKPnSqicPNdO4/pkL9XnE4qGKTm+U= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E772F1063; Fri, 4 Oct 2024 03:42:06 -0700 (PDT) Received: from e121540-lin.manchester.arm.com (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 9B2253F640; Fri, 4 Oct 2024 03:41:36 -0700 (PDT) From: Richard Sandiford To: rguenther@suse.de, tamar.christina@arm.com, gcc-patches@gcc.gnu.org Cc: Richard Sandiford Subject: [PATCH 2/4] vect: Restructure repeating_p case for SLP permutations Date: Fri, 4 Oct 2024 11:40:52 +0100 Message-Id: <20241004104054.2653382-3-richard.sandiford@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20241004104054.2653382-1-richard.sandiford@arm.com> References: <20241004104054.2653382-1-richard.sandiford@arm.com> MIME-Version: 1.0 X-Spam-Status: No, score=-18.8 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org The repeating_p case previously handled the specific situation in which the inputs have N lanes and the output has N lanes, where N divides the number of vector elements. In that case, every output uses the same permute vector. The code was therefore structured so that the outer loop only constructed one permute vector, with an inner loop generating as many VEC_PERM_EXPRs from it as required. However, the main patch for PR116583 adds support for cycling through N permute vectors, rather than just having one. The current structure doesn't really handle that case well. (We'd need to interleave the results after generating them, which sounds a bit fragile.) This patch instead makes the transform phase calculate each output vector's permutation explicitly, like for the !repeating_p path. As a bonus, it gets rid of one use of SLP_TREE_NUMBER_OF_VEC_STMTS. This arguably undermines one of the justifications for using repeating_p for constant-length vectors: that the repeating_p path involved less work than the !repeating_p path. That justification does still hold for the analysis phase, though, and that should be the more time-sensitive part. And the other justification -- to get more coverage of the code -- still applies. So I'd prefer that we continue to use repeating_p for constant-length vectors unless that causes a known missed optimisation. gcc/ PR tree-optimization/116583 * tree-vect-slp.cc (vectorizable_slp_permutation_1): Remove the noutputs_per_mask inner loop and instead generate a separate permute vector for each output. --- gcc/tree-vect-slp.cc | 75 ++++++++++++++++++++++++-------------------- 1 file changed, 41 insertions(+), 34 deletions(-) diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 7aeda69f447..470128ea775 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -10243,26 +10243,33 @@ vectorizable_slp_permutation_1 (vec_info *vinfo, gimple_stmt_iterator *gsi, return 1; } - /* REPEATING_P is true if every output vector is guaranteed to use the - same permute vector. We can handle that case for both variable-length - and constant-length vectors, but we only handle other cases for - constant-length vectors. + /* Set REPEATING_P to true if every output uses the same permute vector + and if we can generate the vectors in a vector-length agnostic way. + + When REPEATING_P is true, NOUTPUTS holds the total number of outputs + that we actually need to generate. */ + uint64_t noutputs = 0; + loop_vec_info linfo = dyn_cast (vinfo); + if (!linfo + || !constant_multiple_p (LOOP_VINFO_VECT_FACTOR (linfo) + * SLP_TREE_LANES (node), nunits, &noutputs)) + repeating_p = false; + + /* We can handle the conditions described for REPEATING_P above for + both variable- and constant-length vectors. The fallback requires + us to generate every element of every permute vector explicitly, + which is only possible for constant-length permute vectors. Set: - NPATTERNS and NELTS_PER_PATTERN to the encoding of the permute - mask vector that we want to build. + mask vectors that we want to build. - NCOPIES to the number of copies of PERM that we need in order - to build the necessary permute mask vectors. - - - NOUTPUTS_PER_MASK to the number of output vectors we want to create - for each permute mask vector. This is only relevant when GSI is - nonnull. */ + to build the necessary permute mask vectors. */ uint64_t npatterns; unsigned nelts_per_pattern; uint64_t ncopies; - unsigned noutputs_per_mask; if (repeating_p) { /* We need a single permute mask vector that has the form: @@ -10274,7 +10281,6 @@ vectorizable_slp_permutation_1 (vec_info *vinfo, gimple_stmt_iterator *gsi, that we use for permutes requires 3n elements. */ npatterns = SLP_TREE_LANES (node); nelts_per_pattern = ncopies = 3; - noutputs_per_mask = SLP_TREE_NUMBER_OF_VEC_STMTS (node); } else { @@ -10284,10 +10290,8 @@ vectorizable_slp_permutation_1 (vec_info *vinfo, gimple_stmt_iterator *gsi, || !TYPE_VECTOR_SUBPARTS (op_vectype).is_constant ()) return -1; nelts_per_pattern = ncopies = 1; - if (loop_vec_info linfo = dyn_cast (vinfo)) - if (!LOOP_VINFO_VECT_FACTOR (linfo).is_constant (&ncopies)) - return -1; - noutputs_per_mask = 1; + if (linfo && !LOOP_VINFO_VECT_FACTOR (linfo).is_constant (&ncopies)) + return -1; } unsigned olanes = ncopies * SLP_TREE_LANES (node); gcc_assert (repeating_p || multiple_p (olanes, nunits)); @@ -10364,16 +10368,24 @@ vectorizable_slp_permutation_1 (vec_info *vinfo, gimple_stmt_iterator *gsi, mask.quick_grow (count); vec_perm_indices indices; unsigned nperms = 0; - for (unsigned i = 0; i < vperm.length (); ++i) - { - mask_element = vperm[i].second; + /* When REPEATING_P is true, we only have one unique permute vector + to check during analysis, but we need to generate NOUTPUTS vectors + during transformation. */ + unsigned total_nelts = olanes; + if (repeating_p && gsi) + total_nelts *= noutputs; + for (unsigned i = 0; i < total_nelts; ++i) + { + unsigned vi = i / olanes; + unsigned ei = i % olanes; + mask_element = vperm[ei].second; if (first_vec.first == -1U - || first_vec == vperm[i].first) - first_vec = vperm[i].first; + || first_vec == vperm[ei].first) + first_vec = vperm[ei].first; else if (second_vec.first == -1U - || second_vec == vperm[i].first) + || second_vec == vperm[ei].first) { - second_vec = vperm[i].first; + second_vec = vperm[ei].first; mask_element += nunits; } else @@ -10437,17 +10449,12 @@ vectorizable_slp_permutation_1 (vec_info *vinfo, gimple_stmt_iterator *gsi, if (!identity_p) mask_vec = vect_gen_perm_mask_checked (vectype, indices); - for (unsigned int vi = 0; vi < noutputs_per_mask; ++vi) - { - tree first_def - = vect_get_slp_vect_def (first_node, - first_vec.second + vi); - tree second_def - = vect_get_slp_vect_def (second_node, - second_vec.second + vi); - vect_add_slp_permutation (vinfo, gsi, node, first_def, - second_def, mask_vec, mask[0]); - } + tree first_def + = vect_get_slp_vect_def (first_node, first_vec.second + vi); + tree second_def + = vect_get_slp_vect_def (second_node, second_vec.second + vi); + vect_add_slp_permutation (vinfo, gsi, node, first_def, + second_def, mask_vec, mask[0]); } index = 0;