From patchwork Tue Jun 11 09:02:01 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Biener X-Patchwork-Id: 1946188 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256 header.s=susede2_rsa header.b=wCWo1duW; dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256 header.s=susede2_ed25519 header.b=DqnWUAK4; dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de header.a=rsa-sha256 header.s=susede2_rsa header.b=wCWo1duW; dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256 header.s=susede2_ed25519 header.b=DqnWUAK4; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4Vz2kF3qp8z20Py for ; Tue, 11 Jun 2024 19:02:29 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id C045C385DDDB for ; Tue, 11 Jun 2024 09:02:27 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) by sourceware.org (Postfix) with ESMTPS id 53B063858D34 for ; Tue, 11 Jun 2024 09:02:02 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 53B063858D34 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 53B063858D34 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=195.135.223.130 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1718096525; cv=none; b=W60D5pkaZCsew2k2+iEU1dizp2uyDP/Hd3XZKfRd0okRAwk77JSttqi9TJxCBj6TJWhbHCDizl1Ae0Jy8howBCnsIJZ6+1AkPV92dUmfXqZWRO4uB7E77mQ/anOhAKGygX6a+wA1R7hq3GwXO7NdlVoo89dOx1+molkZoPbQe/Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1718096525; c=relaxed/simple; bh=eFl12yrB6a1rw8kr8boSpq8qkhwJCZ6dHLBIS92DAWY=; h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date: From:To:Subject:MIME-Version; b=KP+6hjFCTqQ2X+iqlVQ6K0LQNJl8g4Z7L+Ck2ZrmSjBAjr+LCAyU8/tqiffzkTf6danzoLjb7vGKmoH5LvSmdfwmvRPHxXSbVRY5+uNKEO6oiHcqXhduGPqBG/jckQStYolQdAoZmBXXZPbp597h5N1qofjQkzCXxeKDxO6kbAc= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from murzim.nue2.suse.org (unknown [10.168.4.243]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 3F63F22DD3; Tue, 11 Jun 2024 09:02:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1718096521; h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version: content-type:content-type; bh=cqcKXekq3zQyhxucLcOVP56dD4YIRVV7cJWs0H0A2fQ=; b=wCWo1duWZH7WdXN+q7bSRR8+jZUuoLOco89PeECG/rJyYZs6CxfV5JdGbpnKHZjHUZmpyE FpB+yL9BRDSJsBFPDSR//li+lW0PeGTrxXHrzjznA9FMUQRYUufyp0NEccxOAmXcZpTumU AeQ3UI39wHF6VD8fZfOIFuNxptyytRM= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1718096521; h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version: content-type:content-type; bh=cqcKXekq3zQyhxucLcOVP56dD4YIRVV7cJWs0H0A2fQ=; b=DqnWUAK4eP8txiDCmpvv62tGuwDFWiXeJcCdV9dsIhVCs6m22xFjqXMTbGumicgp6Xb76X W9rM6d5Uh46t16Bw== Authentication-Results: smtp-out1.suse.de; none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1718096521; h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version: content-type:content-type; bh=cqcKXekq3zQyhxucLcOVP56dD4YIRVV7cJWs0H0A2fQ=; b=wCWo1duWZH7WdXN+q7bSRR8+jZUuoLOco89PeECG/rJyYZs6CxfV5JdGbpnKHZjHUZmpyE FpB+yL9BRDSJsBFPDSR//li+lW0PeGTrxXHrzjznA9FMUQRYUufyp0NEccxOAmXcZpTumU AeQ3UI39wHF6VD8fZfOIFuNxptyytRM= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1718096521; h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version: content-type:content-type; bh=cqcKXekq3zQyhxucLcOVP56dD4YIRVV7cJWs0H0A2fQ=; b=DqnWUAK4eP8txiDCmpvv62tGuwDFWiXeJcCdV9dsIhVCs6m22xFjqXMTbGumicgp6Xb76X W9rM6d5Uh46t16Bw== Date: Tue, 11 Jun 2024 11:02:01 +0200 (CEST) From: Richard Biener To: gcc-patches@gcc.gnu.org cc: richard.sandiford@arm.com Subject: [PATCH] tree-optimization/115385 - handle more gaps with peeling of a single iteration MIME-Version: 1.0 X-Spam-Score: -1.74 X-Spam-Level: X-Spamd-Result: default: False [-1.74 / 50.00]; BAYES_HAM(-3.00)[100.00%]; MISSING_MID(2.50)[]; NEURAL_HAM_LONG(-0.98)[-0.985]; NEURAL_HAM_SHORT(-0.15)[-0.771]; MIME_GOOD(-0.10)[text/plain]; RCPT_COUNT_TWO(0.00)[2]; FUZZY_BLOCKED(0.00)[rspamd.com]; ARC_NA(0.00)[]; DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; TO_MATCH_ENVRCPT_ALL(0.00)[]; FROM_HAS_DN(0.00)[]; MISSING_XM_UA(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; TO_DN_NONE(0.00)[]; RCVD_COUNT_ZERO(0.00)[0]; DBL_BLOCKED_OPENRESOLVER(0.00)[tree-vect-stmts.cc:url] X-Spam-Status: No, score=-10.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, MISSING_MID, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Message-Id: <20240611090227.C045C385DDDB@sourceware.org> The following makes peeling of a single scalar iteration handle more gaps, including non-power-of-two cases. This can be done by rounding up the remaining access to the next power-of-two which ensures that the next scalar iteration will pick at least the number of excess elements we access. I've added a correctness testcase and one x86 specific scanning for the optimization. Bootstrapped and tested on x86_64-unknown-linux-gnu, I plan to push this tomorrow after eying the CI. Checking SPEC CPU2017 on x86 shows we have no case left (from 33 before) where peeling for gaps is insufficient. This of course relies on sufficient vec_init optab coverage. Richard. PR tree-optimization/115385 * tree-vect-stmts.cc (get_group_load_store_type): Peeling of a single scalar iteration is sufficient if we can narrow the access to the next power of two of the bits in the last access. (vectorizable_load): Ensure that the last access is narrowed. * gcc.dg/vect/pr115385.c: New testcase. * gcc.target/i386/vect-pr115385.c: Likewise. --- gcc/testsuite/gcc.dg/vect/pr115385.c | 88 +++++++++++++++++++ gcc/testsuite/gcc.target/i386/vect-pr115385.c | 53 +++++++++++ gcc/tree-vect-stmts.cc | 34 ++++++- 3 files changed, 173 insertions(+), 2 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/pr115385.c create mode 100644 gcc/testsuite/gcc.target/i386/vect-pr115385.c diff --git a/gcc/testsuite/gcc.dg/vect/pr115385.c b/gcc/testsuite/gcc.dg/vect/pr115385.c new file mode 100644 index 00000000000..a18cd665d7d --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr115385.c @@ -0,0 +1,88 @@ +/* { dg-require-effective-target mmap } */ + +#include +#include + +#define COUNT 511 +#define MMAP_SIZE 0x20000 +#define ADDRESS 0x1122000000 +#define TYPE unsigned char + +#ifndef MAP_ANONYMOUS +#define MAP_ANONYMOUS MAP_ANON +#endif + +void __attribute__((noipa)) foo(TYPE * __restrict x, + TYPE *y, int n) +{ + for (int i = 0; i < n; ++i) + { + x[16*i+0] = y[3*i+0]; + x[16*i+1] = y[3*i+1]; + x[16*i+2] = y[3*i+2]; + x[16*i+3] = y[3*i+0]; + x[16*i+4] = y[3*i+1]; + x[16*i+5] = y[3*i+2]; + x[16*i+6] = y[3*i+0]; + x[16*i+7] = y[3*i+1]; + x[16*i+8] = y[3*i+2]; + x[16*i+9] = y[3*i+0]; + x[16*i+10] = y[3*i+1]; + x[16*i+11] = y[3*i+2]; + x[16*i+12] = y[3*i+0]; + x[16*i+13] = y[3*i+1]; + x[16*i+14] = y[3*i+2]; + x[16*i+15] = y[3*i+0]; + } +} + +void __attribute__((noipa)) bar(TYPE * __restrict x, + TYPE *y, int n) +{ + for (int i = 0; i < n; ++i) + { + x[16*i+0] = y[5*i+0]; + x[16*i+1] = y[5*i+1]; + x[16*i+2] = y[5*i+2]; + x[16*i+3] = y[5*i+3]; + x[16*i+4] = y[5*i+4]; + x[16*i+5] = y[5*i+0]; + x[16*i+6] = y[5*i+1]; + x[16*i+7] = y[5*i+2]; + x[16*i+8] = y[5*i+3]; + x[16*i+9] = y[5*i+4]; + x[16*i+10] = y[5*i+0]; + x[16*i+11] = y[5*i+1]; + x[16*i+12] = y[5*i+2]; + x[16*i+13] = y[5*i+3]; + x[16*i+14] = y[5*i+4]; + x[16*i+15] = y[5*i+0]; + } +} + +TYPE x[COUNT * 16]; + +int +main (void) +{ + void *y; + TYPE *end_y; + + y = mmap ((void *) ADDRESS, MMAP_SIZE, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (y == MAP_FAILED) + { + perror ("mmap"); + return 1; + } + + end_y = (TYPE *) ((char *) y + MMAP_SIZE); + + foo (x, end_y - COUNT * 3, COUNT); + bar (x, end_y - COUNT * 5, COUNT); + + return 0; +} + +/* We always require a scalar epilogue here but we don't know which + targets support vector composition this way. */ diff --git a/gcc/testsuite/gcc.target/i386/vect-pr115385.c b/gcc/testsuite/gcc.target/i386/vect-pr115385.c new file mode 100644 index 00000000000..a6be9ce4e54 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-pr115385.c @@ -0,0 +1,53 @@ +/* { dg-do compile } */ +/* { dg-options "-O3 -msse4.1 -mno-avx -fdump-tree-vect-details" } */ + +void __attribute__((noipa)) foo(unsigned char * __restrict x, + unsigned char *y, int n) +{ + for (int i = 0; i < n; ++i) + { + x[16*i+0] = y[3*i+0]; + x[16*i+1] = y[3*i+1]; + x[16*i+2] = y[3*i+2]; + x[16*i+3] = y[3*i+0]; + x[16*i+4] = y[3*i+1]; + x[16*i+5] = y[3*i+2]; + x[16*i+6] = y[3*i+0]; + x[16*i+7] = y[3*i+1]; + x[16*i+8] = y[3*i+2]; + x[16*i+9] = y[3*i+0]; + x[16*i+10] = y[3*i+1]; + x[16*i+11] = y[3*i+2]; + x[16*i+12] = y[3*i+0]; + x[16*i+13] = y[3*i+1]; + x[16*i+14] = y[3*i+2]; + x[16*i+15] = y[3*i+0]; + } +} + +void __attribute__((noipa)) bar(unsigned char * __restrict x, + unsigned char *y, int n) +{ + for (int i = 0; i < n; ++i) + { + x[16*i+0] = y[5*i+0]; + x[16*i+1] = y[5*i+1]; + x[16*i+2] = y[5*i+2]; + x[16*i+3] = y[5*i+3]; + x[16*i+4] = y[5*i+4]; + x[16*i+5] = y[5*i+0]; + x[16*i+6] = y[5*i+1]; + x[16*i+7] = y[5*i+2]; + x[16*i+8] = y[5*i+3]; + x[16*i+9] = y[5*i+4]; + x[16*i+10] = y[5*i+0]; + x[16*i+11] = y[5*i+1]; + x[16*i+12] = y[5*i+2]; + x[16*i+13] = y[5*i+3]; + x[16*i+14] = y[5*i+4]; + x[16*i+15] = y[5*i+0]; + } +} + +/* { dg-final { scan-tree-dump "Data access with gaps requires scalar epilogue loop" "vect"} } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect"} } */ diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index 54106a2be37..8aa41833433 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -2142,7 +2142,7 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, "Peeling for outer loop is not supported\n"); return false; } - unsigned HOST_WIDE_INT cnunits, cvf; + unsigned HOST_WIDE_INT cnunits, cvf, cremain, cpart_size; if (overrun_p && (!nunits.is_constant (&cnunits) || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant (&cvf) @@ -2151,7 +2151,16 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, access excess elements. ??? Enhancements include peeling multiple iterations or using masked loads with a static mask. */ - || (group_size * cvf) % cnunits + group_size - gap < cnunits)) + || ((group_size * cvf) % cnunits + group_size - gap < cnunits + /* But peeling a single scalar iteration is enough if + we can use the next power-of-two sized partial + access. */ + && ((cremain = (group_size * cvf - gap) % cnunits), true + && ((cpart_size = (1 << ceil_log2 (cremain))) + != cnunits) + && vector_vector_composition_type + (vectype, cnunits / cpart_size, + &half_vtype) == NULL_TREE)))) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, @@ -11599,6 +11608,27 @@ vectorizable_load (vec_info *vinfo, gcc_assert (new_vtype || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)); + /* But still reduce the access size to the next + required power-of-two so peeling a single + scalar iteration is sufficient. */ + unsigned HOST_WIDE_INT cremain; + if (remain.is_constant (&cremain)) + { + unsigned HOST_WIDE_INT cpart_size + = 1 << ceil_log2 (cremain); + if (known_gt (nunits, cpart_size) + && constant_multiple_p (nunits, cpart_size, + &num)) + { + tree ptype; + new_vtype + = vector_vector_composition_type (vectype, + num, + &ptype); + if (new_vtype) + ltype = ptype; + } + } } } tree offset