From patchwork Wed Jun 12 09:16:03 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Biener X-Patchwork-Id: 1946758 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256 header.s=susede2_rsa header.b=sgh/bt59; dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256 header.s=susede2_ed25519 header.b=XwnXsZtb; dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de header.a=rsa-sha256 header.s=susede2_rsa header.b=uoh/HyWR; dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256 header.s=susede2_ed25519 header.b=VF0xN5Y5; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4Vzg0s3cDNz20KL for ; Wed, 12 Jun 2024 19:17:17 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 22DAE385DDF1 for ; Wed, 12 Jun 2024 09:17:15 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from smtp-out1.suse.de (smtp-out1.suse.de [IPv6:2a07:de40:b251:101:10:150:64:1]) by sourceware.org (Postfix) with ESMTPS id E0CED385DDE7 for ; Wed, 12 Jun 2024 09:16:06 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org E0CED385DDE7 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de ARC-Filter: OpenARC Filter v1.0.0 sourceware.org E0CED385DDE7 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a07:de40:b251:101:10:150:64:1 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1718183769; cv=none; b=hDY8hphHXqoO0fTPc8IarGZ5inhQ3npfx5NNXcBtNIaBi82pBzTYgvbOwuZQ0KuiywhoOipZ2jzAczZIA6feDPqGc+7UtgYbHGrz8/2hLCUlkfHOrnx14ZCxn5E9kBjv7hJGeVHyW4fVtwFDsV02hDEKpGZKScpz28d1yg2OFAE= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1718183769; c=relaxed/simple; bh=McagqQwWsyWrtPoizlOwAbaHMJ45nSzsWXNMr7V76ks=; h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date: From:To:Subject:MIME-Version; b=M9d7BfX6oRFal5o9xjTzhJXEMkxwTTFRl10m7uS2z9FDWioqU8EvRH9R6XR0KzeMVfeOSVk2xtkrU3El+fjqf3fw+T7juJ++5FCfBIsuJfnnMbAuV/1JpdemlbTPR4bf3mnrK6AOjsnObjLHeRh1ZMARtx3FvTnR23xWz4Qd9mA= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from murzim.nue2.suse.org (unknown [10.168.4.243]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 00E0633FF4 for ; Wed, 12 Jun 2024 09:16:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1718183766; h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version: content-type:content-type; bh=/fhvK/bV+lWUGiUHVnBNzHesR3o86Ssaok85mZKFSiM=; b=sgh/bt59MMRZBNlptCRb8pCyFcKGO7nzfhC7gObNzUt8tS/5ZQpaYW8Jswye0wr/wEwa3h TbQWlZHyUAeqvGjPKZh/vQScMPi/8OGo2UYt/lOwUCyKV2SzapThq0WwEnpcLJdFzvxupJ H1/GhQEpzusUw04WhHp83PxBZCnkT8A= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1718183766; h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version: content-type:content-type; bh=/fhvK/bV+lWUGiUHVnBNzHesR3o86Ssaok85mZKFSiM=; b=XwnXsZtbSxwHiuYE8IVmATby5ihEm14CdusonhooeY5FCNaVxzs4UgOmUwih9vMOgaGzfp munvgZ7PcjA6YzAA== Authentication-Results: smtp-out1.suse.de; none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1718183764; h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version: content-type:content-type; bh=/fhvK/bV+lWUGiUHVnBNzHesR3o86Ssaok85mZKFSiM=; b=uoh/HyWRxdqeEWC+PUVKldOZcE/f3AcwAMPm5LoZPOL3CKmvDU94N2B3Ed2sIv7PFbY5et BCOIRxo37terC9C8F91FWVcPXNjOszLVEFj9VmFqSdy9zmG6uSJuVT8A++9nqM9JRnvRvQ QmtM1mW5mCK+yfLeyGVxXtdi/Yt8aLs= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1718183764; h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version: content-type:content-type; bh=/fhvK/bV+lWUGiUHVnBNzHesR3o86Ssaok85mZKFSiM=; b=VF0xN5Y5VFev5ROBQvKW0itAkyrQeYR2Jzz2c4fYXoLvC+6GNe5gP9WpDosu9mT9UoE+oU AtYO+dm6pdpkzkAg== Date: Wed, 12 Jun 2024 11:16:03 +0200 (CEST) From: Richard Biener To: gcc-patches@gcc.gnu.org Subject: [PATCH 3/3][v3] Improve code generation of strided SLP loads MIME-Version: 1.0 X-Spamd-Result: default: False [-0.88 / 50.00]; BAYES_HAM(-3.00)[100.00%]; MISSING_MID(2.50)[]; NEURAL_HAM_SHORT(-0.19)[-0.929]; MIME_GOOD(-0.10)[text/plain]; NEURAL_HAM_LONG(-0.09)[-0.090]; ARC_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; RCPT_COUNT_ONE(0.00)[1]; MISSING_XM_UA(0.00)[]; RCVD_COUNT_ZERO(0.00)[0]; FROM_HAS_DN(0.00)[]; FUZZY_BLOCKED(0.00)[rspamd.com]; FROM_EQ_ENVFROM(0.00)[]; DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; TO_DN_NONE(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; DBL_BLOCKED_OPENRESOLVER(0.00)[murzim.nue2.suse.org:helo] X-Spam-Score: -0.88 X-Spam-Level: X-Spam-Status: No, score=-10.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, MISSING_MID, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Message-Id: <20240612091715.22DAE385DDF1@sourceware.org> This avoids falling back to elementwise accesses for strided SLP loads when the group size is not a multiple of the vector element size. Instead we can use a smaller vector or integer type for the load. For stores we can do the same though restrictions on stores we handle and the fact that store-merging covers up makes this mostly effective for cost modeling which shows for gcc.target/i386/vect-strided-3.c which we now vectorize with V4SI vectors rather than just V2SI ones. For all of this there's still the opportunity to use non-uniform accesses, say for a 6-element group with a VF of two do V4SI, { V2SI, V2SI }, V4SI. But that's for a possible followup. * gcc.target/i386/vect-strided-1.c: New testcase. * gcc.target/i386/vect-strided-2.c: Likewise. * gcc.target/i386/vect-strided-3.c: Likewise. * gcc.target/i386/vect-strided-4.c: Likewise. --- .../gcc.target/i386/vect-strided-1.c | 24 +++++ .../gcc.target/i386/vect-strided-2.c | 17 +++ .../gcc.target/i386/vect-strided-3.c | 20 ++++ .../gcc.target/i386/vect-strided-4.c | 20 ++++ gcc/tree-vect-stmts.cc | 100 ++++++++---------- 5 files changed, 127 insertions(+), 54 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/vect-strided-1.c create mode 100644 gcc/testsuite/gcc.target/i386/vect-strided-2.c create mode 100644 gcc/testsuite/gcc.target/i386/vect-strided-3.c create mode 100644 gcc/testsuite/gcc.target/i386/vect-strided-4.c diff --git a/gcc/testsuite/gcc.target/i386/vect-strided-1.c b/gcc/testsuite/gcc.target/i386/vect-strided-1.c new file mode 100644 index 00000000000..db4a06711f1 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-strided-1.c @@ -0,0 +1,24 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -msse2 -mno-avx" } */ + +void foo (int * __restrict a, int *b, int s) +{ + for (int i = 0; i < 1024; ++i) + { + a[8*i+0] = b[s*i+0]; + a[8*i+1] = b[s*i+1]; + a[8*i+2] = b[s*i+2]; + a[8*i+3] = b[s*i+3]; + a[8*i+4] = b[s*i+4]; + a[8*i+5] = b[s*i+5]; + a[8*i+6] = b[s*i+4]; + a[8*i+7] = b[s*i+5]; + } +} + +/* Three two-element loads, two four-element stores. On ia32 we elide + a permute and perform a redundant load. */ +/* { dg-final { scan-assembler-times "movq" 2 } } */ +/* { dg-final { scan-assembler-times "movhps" 2 { target ia32 } } } */ +/* { dg-final { scan-assembler-times "movhps" 1 { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-times "movups" 2 } } */ diff --git a/gcc/testsuite/gcc.target/i386/vect-strided-2.c b/gcc/testsuite/gcc.target/i386/vect-strided-2.c new file mode 100644 index 00000000000..6fd64e28cf0 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-strided-2.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -msse2 -mno-avx" } */ + +void foo (int * __restrict a, int *b, int s) +{ + for (int i = 0; i < 1024; ++i) + { + a[4*i+0] = b[s*i+0]; + a[4*i+1] = b[s*i+1]; + a[4*i+2] = b[s*i+0]; + a[4*i+3] = b[s*i+1]; + } +} + +/* One two-element load, one four-element store. */ +/* { dg-final { scan-assembler-times "movq" 1 } } */ +/* { dg-final { scan-assembler-times "movups" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/vect-strided-3.c b/gcc/testsuite/gcc.target/i386/vect-strided-3.c new file mode 100644 index 00000000000..b462701a0b2 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-strided-3.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -msse2 -mno-avx -fno-tree-slp-vectorize" } */ + +void foo (int * __restrict a, int *b, int s) +{ + if (s >= 6) + for (int i = 0; i < 1024; ++i) + { + a[s*i+0] = b[4*i+0]; + a[s*i+1] = b[4*i+1]; + a[s*i+2] = b[4*i+2]; + a[s*i+3] = b[4*i+3]; + a[s*i+4] = b[4*i+0]; + a[s*i+5] = b[4*i+1]; + } +} + +/* While the vectorizer generates 6 uint64 stores. */ +/* { dg-final { scan-assembler-times "movq" 4 } } */ +/* { dg-final { scan-assembler-times "movhps" 2 } } */ diff --git a/gcc/testsuite/gcc.target/i386/vect-strided-4.c b/gcc/testsuite/gcc.target/i386/vect-strided-4.c new file mode 100644 index 00000000000..dd922926a2a --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-strided-4.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -msse4.2 -mno-avx -fno-tree-slp-vectorize" } */ + +void foo (int * __restrict a, int * __restrict b, int *c, int s) +{ + if (s >= 2) + for (int i = 0; i < 1024; ++i) + { + a[s*i+0] = c[4*i+0]; + a[s*i+1] = c[4*i+1]; + b[s*i+0] = c[4*i+2]; + b[s*i+1] = c[4*i+3]; + } +} + +/* Vectorization factor two, two two-element stores to a using movq + and two two-element stores to b via pextrq/movhps of the high part. */ +/* { dg-final { scan-assembler-times "movq" 2 } } */ +/* { dg-final { scan-assembler-times "pextrq" 2 { target { ! ia32 } } } } */ +/* { dg-final { scan-assembler-times "movhps" 2 { target { ia32 } } } } */ diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index 701a44e44cd..d148e11a514 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -2036,15 +2036,10 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, first_dr_info = STMT_VINFO_DR_INFO (SLP_TREE_SCALAR_STMTS (slp_node)[0]); if (STMT_VINFO_STRIDED_P (first_stmt_info)) - { - /* Try to use consecutive accesses of DR_GROUP_SIZE elements, - separated by the stride, until we have a complete vector. - Fall back to scalar accesses if that isn't possible. */ - if (multiple_p (nunits, group_size)) - *memory_access_type = VMAT_STRIDED_SLP; - else - *memory_access_type = VMAT_ELEMENTWISE; - } + /* Try to use consecutive accesses of as many elements as possible, + separated by the stride, until we have a complete vector. + Fall back to scalar accesses if that isn't possible. */ + *memory_access_type = VMAT_STRIDED_SLP; else { int cmp = compare_step_with_zero (vinfo, stmt_info); @@ -8514,12 +8509,29 @@ vectorizable_store (vec_info *vinfo, tree lvectype = vectype; if (slp) { - if (group_size < const_nunits - && const_nunits % group_size == 0) + HOST_WIDE_INT n = gcd (group_size, const_nunits); + if (n == const_nunits) { - nstores = const_nunits / group_size; - lnel = group_size; - ltype = build_vector_type (elem_type, group_size); + int mis_align = dr_misalignment (first_dr_info, vectype); + dr_alignment_support dr_align + = vect_supportable_dr_alignment (vinfo, dr_info, vectype, + mis_align); + if (dr_align == dr_aligned + || dr_align == dr_unaligned_supported) + { + nstores = 1; + lnel = const_nunits; + ltype = vectype; + lvectype = vectype; + alignment_support_scheme = dr_align; + misalignment = mis_align; + } + } + else if (n > 1) + { + nstores = const_nunits / n; + lnel = n; + ltype = build_vector_type (elem_type, n); lvectype = vectype; /* First check if vec_extract optab doesn't support extraction @@ -8528,7 +8540,7 @@ vectorizable_store (vec_info *vinfo, machine_mode vmode; if (!VECTOR_MODE_P (TYPE_MODE (vectype)) || !related_vector_mode (TYPE_MODE (vectype), elmode, - group_size).exists (&vmode) + n).exists (&vmode) || (convert_optab_handler (vec_extract_optab, TYPE_MODE (vectype), vmode) == CODE_FOR_nothing)) @@ -8539,8 +8551,8 @@ vectorizable_store (vec_info *vinfo, re-interpreting it as the original vector type if supported. */ unsigned lsize - = group_size * GET_MODE_BITSIZE (elmode); - unsigned int lnunits = const_nunits / group_size; + = n * GET_MODE_BITSIZE (elmode); + unsigned int lnunits = const_nunits / n; /* If we can't construct such a vector fall back to element extracts from the original vector type and element size stores. */ @@ -8553,7 +8565,7 @@ vectorizable_store (vec_info *vinfo, != CODE_FOR_nothing)) { nstores = lnunits; - lnel = group_size; + lnel = n; ltype = build_nonstandard_integer_type (lsize, 1); lvectype = build_vector_type (ltype, nstores); } @@ -8564,24 +8576,6 @@ vectorizable_store (vec_info *vinfo, issue exists here for reasonable archs. */ } } - else if (group_size >= const_nunits - && group_size % const_nunits == 0) - { - int mis_align = dr_misalignment (first_dr_info, vectype); - dr_alignment_support dr_align - = vect_supportable_dr_alignment (vinfo, dr_info, vectype, - mis_align); - if (dr_align == dr_aligned - || dr_align == dr_unaligned_supported) - { - nstores = 1; - lnel = const_nunits; - ltype = vectype; - lvectype = vectype; - alignment_support_scheme = dr_align; - misalignment = mis_align; - } - } ltype = build_aligned_type (ltype, TYPE_ALIGN (elem_type)); ncopies = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); } @@ -10366,34 +10360,32 @@ vectorizable_load (vec_info *vinfo, auto_vec dr_chain; if (memory_access_type == VMAT_STRIDED_SLP) { - if (group_size < const_nunits) + HOST_WIDE_INT n = gcd (group_size, const_nunits); + /* Use the target vector type if the group size is a multiple + of it. */ + if (n == const_nunits) + { + nloads = 1; + lnel = const_nunits; + ltype = vectype; + } + /* Else use the biggest vector we can load the group without + accessing excess elements. */ + else if (n > 1) { - /* First check if vec_init optab supports construction from vector - elts directly. Otherwise avoid emitting a constructor of - vector elements by performing the loads using an integer type - of the same size, constructing a vector of those and then - re-interpreting it as the original vector type. This avoids a - huge runtime penalty due to the general inability to perform - store forwarding from smaller stores to a larger load. */ tree ptype; tree vtype - = vector_vector_composition_type (vectype, - const_nunits / group_size, + = vector_vector_composition_type (vectype, const_nunits / n, &ptype); if (vtype != NULL_TREE) { - nloads = const_nunits / group_size; - lnel = group_size; + nloads = const_nunits / n; + lnel = n; lvectype = vtype; ltype = ptype; } } - else - { - nloads = 1; - lnel = const_nunits; - ltype = vectype; - } + /* Else fall back to the default element-wise access. */ ltype = build_aligned_type (ltype, TYPE_ALIGN (TREE_TYPE (vectype))); } /* Load vector(1) scalar_type if it's 1 element-wise vectype. */