From patchwork Tue Jun 11 09:04:04 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Biener X-Patchwork-Id: 1946189 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256 header.s=susede2_rsa header.b=G4EHWkgg; dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256 header.s=susede2_ed25519 header.b=/DulbCKL; dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de header.a=rsa-sha256 header.s=susede2_rsa header.b=G4EHWkgg; dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256 header.s=susede2_ed25519 header.b=/DulbCKL; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4Vz2mf2L2Qz20Py for ; Tue, 11 Jun 2024 19:04:34 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 71E5A3858C50 for ; Tue, 11 Jun 2024 09:04:32 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2a07:de40:b251:101:10:150:64:2]) by sourceware.org (Postfix) with ESMTPS id 2CDB23858D29 for ; Tue, 11 Jun 2024 09:04:05 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 2CDB23858D29 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 2CDB23858D29 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a07:de40:b251:101:10:150:64:2 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1718096648; cv=none; b=Mifue/gVZXOMZU4/d3xBieCsuKpKVzS+Ve4rNV4af/mHuholDV+idFVvICpYjhYhBVsQ+awtl8KgURPF2VpsM8JwCHPfoLnGzXt96QIAM+bE/tYQArHjVRtqhcH07SblpoDMv2fIQIQR2kEFW/ye5+w4duecxIgc4JKggE0nst4= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1718096648; c=relaxed/simple; bh=/O1SdAsJ8LRM75c6PmmmG1IEbgM4KLemp6mV93eKWWY=; h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date: From:To:Subject:MIME-Version; b=do0GZZPImENwRlut0QE0Cn/duyXWK5ofmOz75X0w4k/LOYQ6Fz2ugt4rORMr1nXgblZQdzMZX5F7C3cQAH7ZvynzvGmLFpusPaGo2wGSIaspOvQXi7HvfClri9z0ku+TxK8zgyx0Gk0/cZcn06rcz0WHejj0vaCJ4w+lH7C5SpI= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from murzim.nue2.suse.org (unknown [10.168.4.243]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 3C0761F8B0; Tue, 11 Jun 2024 09:04:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1718096644; h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version: content-type:content-type; bh=AWNkqnlwWoMjXqkTvMA57OZtAO71ZMcKXXrVwkSQJBo=; b=G4EHWkggVdfr9IsxODik1Tf21adf1cqo20BhG/Th+ck6mXVR4LUyajBR5R03fK1jAu8SYv MZL9h/PqktMWJCWRymDIPr4Sr2qJFU9KtbEG+5ANIhzgmIG7UMbTCbTZYxxflwbUVnuQyD Yf3lLiKdVlaECHmCN8qZDcq8E38Vk9o= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1718096644; h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version: content-type:content-type; bh=AWNkqnlwWoMjXqkTvMA57OZtAO71ZMcKXXrVwkSQJBo=; b=/DulbCKLargVpej2IyD9TbeXm+qObgVmKjZE+gkv+i6QPHtCCvEmOb8GgJXfXJ2ILL21iS QPspAE73jcEvMaDA== Authentication-Results: smtp-out2.suse.de; none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1718096644; h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version: content-type:content-type; bh=AWNkqnlwWoMjXqkTvMA57OZtAO71ZMcKXXrVwkSQJBo=; b=G4EHWkggVdfr9IsxODik1Tf21adf1cqo20BhG/Th+ck6mXVR4LUyajBR5R03fK1jAu8SYv MZL9h/PqktMWJCWRymDIPr4Sr2qJFU9KtbEG+5ANIhzgmIG7UMbTCbTZYxxflwbUVnuQyD Yf3lLiKdVlaECHmCN8qZDcq8E38Vk9o= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1718096644; h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version: content-type:content-type; bh=AWNkqnlwWoMjXqkTvMA57OZtAO71ZMcKXXrVwkSQJBo=; b=/DulbCKLargVpej2IyD9TbeXm+qObgVmKjZE+gkv+i6QPHtCCvEmOb8GgJXfXJ2ILL21iS QPspAE73jcEvMaDA== Date: Tue, 11 Jun 2024 11:04:04 +0200 (CEST) From: Richard Biener To: gcc-patches@gcc.gnu.org cc: richard.sandiford@arm.com Subject: [PATCH] Improve code generation of strided SLP loads MIME-Version: 1.0 X-Spam-Level: X-Spamd-Result: default: False [-1.67 / 50.00]; BAYES_HAM(-3.00)[100.00%]; MISSING_MID(2.50)[]; NEURAL_HAM_LONG(-0.99)[-0.991]; MIME_GOOD(-0.10)[text/plain]; NEURAL_HAM_SHORT(-0.08)[-0.394]; FROM_HAS_DN(0.00)[]; RCVD_COUNT_ZERO(0.00)[0]; MISSING_XM_UA(0.00)[]; ARC_NA(0.00)[]; RCPT_COUNT_TWO(0.00)[2]; DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; TO_MATCH_ENVRCPT_ALL(0.00)[]; FUZZY_BLOCKED(0.00)[rspamd.com]; FROM_EQ_ENVFROM(0.00)[]; TO_DN_NONE(0.00)[]; MIME_TRACE(0.00)[0:+] X-Spam-Score: -1.67 X-Spam-Status: No, score=-10.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, MISSING_MID, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Message-Id: <20240611090432.71E5A3858C50@sourceware.org> This avoids falling back to elementwise accesses for strided SLP loads when the group size is not a multiple of the vector element size. Instead we can use a smaller vector or integer type for the load. For stores we can do the same though restrictions on stores we handle and the fact that store-merging covers up makes this mostly effective for cost modeling which shows for gcc.target/i386/vect-strided-3.c which we now vectorize with V4SI vectors rather than just V2SI ones. For all of this there's still the opportunity to use non-uniform accesses, say for a 6-element group with a VF of two do V4SI, { V2SI, V2SI }, V4SI. But that's for a possible followup. Bootstrapped and tested on x86_64-unknown-linux-gnu, textually this depends on the gap improvement series so I'll push only after those. Target independent testing is difficult, strided accesses are difficult for VLA - I suppose they should go through gather/scatter but we have to be able to construct the offset vector there. Richard. * gcc.target/i386/vect-strided-1.c: New testcase. * gcc.target/i386/vect-strided-2.c: Likewise. * gcc.target/i386/vect-strided-3.c: Likewise. * gcc.target/i386/vect-strided-4.c: Likewise. --- .../gcc.target/i386/vect-strided-1.c | 24 +++++ .../gcc.target/i386/vect-strided-2.c | 17 +++ .../gcc.target/i386/vect-strided-3.c | 20 ++++ .../gcc.target/i386/vect-strided-4.c | 20 ++++ gcc/tree-vect-stmts.cc | 100 ++++++++---------- 5 files changed, 127 insertions(+), 54 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/vect-strided-1.c create mode 100644 gcc/testsuite/gcc.target/i386/vect-strided-2.c create mode 100644 gcc/testsuite/gcc.target/i386/vect-strided-3.c create mode 100644 gcc/testsuite/gcc.target/i386/vect-strided-4.c diff --git a/gcc/testsuite/gcc.target/i386/vect-strided-1.c b/gcc/testsuite/gcc.target/i386/vect-strided-1.c new file mode 100644 index 00000000000..db4a06711f1 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-strided-1.c @@ -0,0 +1,24 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -msse2 -mno-avx" } */ + +void foo (int * __restrict a, int *b, int s) +{ + for (int i = 0; i < 1024; ++i) + { + a[8*i+0] = b[s*i+0]; + a[8*i+1] = b[s*i+1]; + a[8*i+2] = b[s*i+2]; + a[8*i+3] = b[s*i+3]; + a[8*i+4] = b[s*i+4]; + a[8*i+5] = b[s*i+5]; + a[8*i+6] = b[s*i+4]; + a[8*i+7] = b[s*i+5]; + } +} + +/* Three two-element loads, two four-element stores. On ia32 we elide + a permute and perform a redundant load. */ +/* { dg-final { scan-assembler-times "movq" 2 } } */ +/* { dg-final { scan-assembler-times "movhps" 2 { target ia32 } } } */ +/* { dg-final { scan-assembler-times "movhps" 1 { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-times "movups" 2 } } */ diff --git a/gcc/testsuite/gcc.target/i386/vect-strided-2.c b/gcc/testsuite/gcc.target/i386/vect-strided-2.c new file mode 100644 index 00000000000..6fd64e28cf0 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-strided-2.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -msse2 -mno-avx" } */ + +void foo (int * __restrict a, int *b, int s) +{ + for (int i = 0; i < 1024; ++i) + { + a[4*i+0] = b[s*i+0]; + a[4*i+1] = b[s*i+1]; + a[4*i+2] = b[s*i+0]; + a[4*i+3] = b[s*i+1]; + } +} + +/* One two-element load, one four-element store. */ +/* { dg-final { scan-assembler-times "movq" 1 } } */ +/* { dg-final { scan-assembler-times "movups" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/vect-strided-3.c b/gcc/testsuite/gcc.target/i386/vect-strided-3.c new file mode 100644 index 00000000000..b462701a0b2 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-strided-3.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -msse2 -mno-avx -fno-tree-slp-vectorize" } */ + +void foo (int * __restrict a, int *b, int s) +{ + if (s >= 6) + for (int i = 0; i < 1024; ++i) + { + a[s*i+0] = b[4*i+0]; + a[s*i+1] = b[4*i+1]; + a[s*i+2] = b[4*i+2]; + a[s*i+3] = b[4*i+3]; + a[s*i+4] = b[4*i+0]; + a[s*i+5] = b[4*i+1]; + } +} + +/* While the vectorizer generates 6 uint64 stores. */ +/* { dg-final { scan-assembler-times "movq" 4 } } */ +/* { dg-final { scan-assembler-times "movhps" 2 } } */ diff --git a/gcc/testsuite/gcc.target/i386/vect-strided-4.c b/gcc/testsuite/gcc.target/i386/vect-strided-4.c new file mode 100644 index 00000000000..dd922926a2a --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-strided-4.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -msse4.2 -mno-avx -fno-tree-slp-vectorize" } */ + +void foo (int * __restrict a, int * __restrict b, int *c, int s) +{ + if (s >= 2) + for (int i = 0; i < 1024; ++i) + { + a[s*i+0] = c[4*i+0]; + a[s*i+1] = c[4*i+1]; + b[s*i+0] = c[4*i+2]; + b[s*i+1] = c[4*i+3]; + } +} + +/* Vectorization factor two, two two-element stores to a using movq + and two two-element stores to b via pextrq/movhps of the high part. */ +/* { dg-final { scan-assembler-times "movq" 2 } } */ +/* { dg-final { scan-assembler-times "pextrq" 2 { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-times "movhps" 2 { target { ia32 } } } } */ diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index 8aa41833433..03a5db45976 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -2036,15 +2036,10 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, first_dr_info = STMT_VINFO_DR_INFO (SLP_TREE_SCALAR_STMTS (slp_node)[0]); if (STMT_VINFO_STRIDED_P (first_stmt_info)) - { - /* Try to use consecutive accesses of DR_GROUP_SIZE elements, - separated by the stride, until we have a complete vector. - Fall back to scalar accesses if that isn't possible. */ - if (multiple_p (nunits, group_size)) - *memory_access_type = VMAT_STRIDED_SLP; - else - *memory_access_type = VMAT_ELEMENTWISE; - } + /* Try to use consecutive accesses of as many elements as possible, + separated by the stride, until we have a complete vector. + Fall back to scalar accesses if that isn't possible. */ + *memory_access_type = VMAT_STRIDED_SLP; else { int cmp = compare_step_with_zero (vinfo, stmt_info); @@ -8512,12 +8507,29 @@ vectorizable_store (vec_info *vinfo, tree lvectype = vectype; if (slp) { - if (group_size < const_nunits - && const_nunits % group_size == 0) + HOST_WIDE_INT n = gcd (group_size, const_nunits); + if (n == const_nunits) { - nstores = const_nunits / group_size; - lnel = group_size; - ltype = build_vector_type (elem_type, group_size); + int mis_align = dr_misalignment (first_dr_info, vectype); + dr_alignment_support dr_align + = vect_supportable_dr_alignment (vinfo, dr_info, vectype, + mis_align); + if (dr_align == dr_aligned + || dr_align == dr_unaligned_supported) + { + nstores = 1; + lnel = const_nunits; + ltype = vectype; + lvectype = vectype; + alignment_support_scheme = dr_align; + misalignment = mis_align; + } + } + else if (n > 1) + { + nstores = const_nunits / n; + lnel = n; + ltype = build_vector_type (elem_type, n); lvectype = vectype; /* First check if vec_extract optab doesn't support extraction @@ -8526,7 +8538,7 @@ vectorizable_store (vec_info *vinfo, machine_mode vmode; if (!VECTOR_MODE_P (TYPE_MODE (vectype)) || !related_vector_mode (TYPE_MODE (vectype), elmode, - group_size).exists (&vmode) + n).exists (&vmode) || (convert_optab_handler (vec_extract_optab, TYPE_MODE (vectype), vmode) == CODE_FOR_nothing)) @@ -8537,8 +8549,8 @@ vectorizable_store (vec_info *vinfo, re-interpreting it as the original vector type if supported. */ unsigned lsize - = group_size * GET_MODE_BITSIZE (elmode); - unsigned int lnunits = const_nunits / group_size; + = n * GET_MODE_BITSIZE (elmode); + unsigned int lnunits = const_nunits / n; /* If we can't construct such a vector fall back to element extracts from the original vector type and element size stores. */ @@ -8551,7 +8563,7 @@ vectorizable_store (vec_info *vinfo, != CODE_FOR_nothing)) { nstores = lnunits; - lnel = group_size; + lnel = n; ltype = build_nonstandard_integer_type (lsize, 1); lvectype = build_vector_type (ltype, nstores); } @@ -8562,24 +8574,6 @@ vectorizable_store (vec_info *vinfo, issue exists here for reasonable archs. */ } } - else if (group_size >= const_nunits - && group_size % const_nunits == 0) - { - int mis_align = dr_misalignment (first_dr_info, vectype); - dr_alignment_support dr_align - = vect_supportable_dr_alignment (vinfo, dr_info, vectype, - mis_align); - if (dr_align == dr_aligned - || dr_align == dr_unaligned_supported) - { - nstores = 1; - lnel = const_nunits; - ltype = vectype; - lvectype = vectype; - alignment_support_scheme = dr_align; - misalignment = mis_align; - } - } ltype = build_aligned_type (ltype, TYPE_ALIGN (elem_type)); ncopies = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); } @@ -10364,34 +10358,32 @@ vectorizable_load (vec_info *vinfo, auto_vec dr_chain; if (memory_access_type == VMAT_STRIDED_SLP) { - if (group_size < const_nunits) + HOST_WIDE_INT n = gcd (group_size, const_nunits); + /* Use the target vector type if the group size is a multiple + of it. */ + if (n == const_nunits) + { + nloads = 1; + lnel = const_nunits; + ltype = vectype; + } + /* Else use the biggest vector we can load the group without + accessing excess elements. */ + else if (n > 1) { - /* First check if vec_init optab supports construction from vector - elts directly. Otherwise avoid emitting a constructor of - vector elements by performing the loads using an integer type - of the same size, constructing a vector of those and then - re-interpreting it as the original vector type. This avoids a - huge runtime penalty due to the general inability to perform - store forwarding from smaller stores to a larger load. */ tree ptype; tree vtype - = vector_vector_composition_type (vectype, - const_nunits / group_size, + = vector_vector_composition_type (vectype, const_nunits / n, &ptype); if (vtype != NULL_TREE) { - nloads = const_nunits / group_size; - lnel = group_size; + nloads = const_nunits / n; + lnel = n; lvectype = vtype; ltype = ptype; } } - else - { - nloads = 1; - lnel = const_nunits; - ltype = vectype; - } + /* Else fall back to the default element-wise access. */ ltype = build_aligned_type (ltype, TYPE_ALIGN (TREE_TYPE (vectype))); } /* Load vector(1) scalar_type if it's 1 element-wise vectype. */