From patchwork Fri Aug 30 13:39:31 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Biener X-Patchwork-Id: 1979020 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256 header.s=susede2_rsa header.b=NraDuAb9; dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256 header.s=susede2_ed25519 header.b=fa2NE5Y2; dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de header.a=rsa-sha256 header.s=susede2_rsa header.b=sDxHU1wG; dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256 header.s=susede2_ed25519 header.b=KnfkwyW6; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=8.43.85.97; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4WwK744WfVz1yfn for ; Fri, 30 Aug 2024 23:41:20 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 9DA36385020A for ; Fri, 30 Aug 2024 13:41:18 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2a07:de40:b251:101:10:150:64:2]) by sourceware.org (Postfix) with ESMTPS id A93C7385EC0C for ; Fri, 30 Aug 2024 13:39:41 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A93C7385EC0C Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de ARC-Filter: OpenARC Filter v1.0.0 sourceware.org A93C7385EC0C Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a07:de40:b251:101:10:150:64:2 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1725025189; cv=none; b=Eb7HKNsZBIIa7N3MJOSYavUrh49PoWgYqx77jnDhkf+RYjJ26JgtgyCW8spp0OOloWM8/uUK+CCmC/McQaLjF7VMFmf/a7kR4nUA0OC1ybfis84O3AOWYgkTzkk1IQvs3otCDcNUsOjQnFyltVK6dzQpxN8CIjope/D/EDn75TA= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1725025189; c=relaxed/simple; bh=SiG2Cq0FF3LjDNOvIbpdQEdXHNCNiM8DOtzdU1FAxO4=; h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date: From:To:Subject:MIME-Version:Message-Id; b=iDlyCmm7qDJIovWMjvAXr9TxyjUruAom6963ldGLnmrwAX/5mJ+viWKWMM8G3WDUDV3od8isgNmRh26X0JbJa/LqA+ZEpFBEwQy1W24Gb/l0ZN95FJAR1EuEuDrBzOi4Sqb7y/hHm3A83aymnDoZk7UListr5U+chOJgmbEh1T8= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from imap1.dmz-prg2.suse.org (unknown [10.150.64.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 94FEA1F7D2 for ; Fri, 30 Aug 2024 13:39:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1725025180; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc: mime-version:mime-version:content-type:content-type; bh=4w/2PXiL59GWQ4nXCJoDPb2/alfCobWHshWZkd8l/po=; b=NraDuAb9EVvzZYjqdpr8wGixNUiftKI/zKIGTUsCv8veeKwbMUf9rl4SoLMn6mY0gf/JBM RrYOl9SCSC+EUIk6CYmczdPdcF5+f2Oe2amSF0MouzK2lalu9Gf0Fwi+ln+Ezls8q79d9C zHovKgqWAfvHFpLIe8DbNiWvunebefQ= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1725025180; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc: mime-version:mime-version:content-type:content-type; bh=4w/2PXiL59GWQ4nXCJoDPb2/alfCobWHshWZkd8l/po=; b=fa2NE5Y2aso2kAzYaXpm9U9s56K6WEvFgCrR/yaLqK/QIa9TRMks6P6SKJVtaG4sGf8304 SJnlOLF8CZ1wIKDw== Authentication-Results: smtp-out2.suse.de; none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1725025179; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc: mime-version:mime-version:content-type:content-type; bh=4w/2PXiL59GWQ4nXCJoDPb2/alfCobWHshWZkd8l/po=; b=sDxHU1wG+fnG/6hvmZcOarny4JhRGYEohsAv4wwkYj1jRJcaiZpEYfgfwiw0UfuTKS7ZLs NMGgnrG2jOSWv7DTxaGud9Cj3V5PuJWZAIe36eIfOM2CmXfMHqeLDcAoKKO5p3ZDq6feDD A+dn2Gp6RLUQYf1Lk+it6x+8JrTFdzg= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1725025179; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc: mime-version:mime-version:content-type:content-type; bh=4w/2PXiL59GWQ4nXCJoDPb2/alfCobWHshWZkd8l/po=; b=KnfkwyW6M3gDq+sygjCBmkJ1ISUbKNnhmHr2zX72CyT4Ev/JOmTpQsi3Ff19ZFP5K1EbvI xrGaOLGLqg2zWcAA== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 795C513A44 for ; Fri, 30 Aug 2024 13:39:39 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id nPL/G5vL0WYPWgAAD6G6ig (envelope-from ) for ; Fri, 30 Aug 2024 13:39:39 +0000 Date: Fri, 30 Aug 2024 15:39:31 +0200 (CEST) From: Richard Biener To: gcc-patches@gcc.gnu.org Subject: [PATCH 2/3] RISC-V: load and store-lanes with SLP MIME-Version: 1.0 Message-Id: <20240830133939.795C513A44@imap1.dmz-prg2.suse.org> X-Spam-Level: X-Spamd-Result: default: False [-4.30 / 50.00]; BAYES_HAM(-3.00)[100.00%]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-0.20)[-1.000]; MIME_GOOD(-0.10)[text/plain]; RCVD_VIA_SMTP_AUTH(0.00)[]; ARC_NA(0.00)[]; RCPT_COUNT_ONE(0.00)[1]; MIME_TRACE(0.00)[0:+]; MISSING_XM_UA(0.00)[]; RCVD_TLS_ALL(0.00)[]; DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; PREVIOUSLY_DELIVERED(0.00)[gcc-patches@gcc.gnu.org]; FROM_EQ_ENVFROM(0.00)[]; FROM_HAS_DN(0.00)[]; FUZZY_BLOCKED(0.00)[rspamd.com]; RCVD_COUNT_TWO(0.00)[2]; TO_MATCH_ENVRCPT_ALL(0.00)[]; TO_DN_NONE(0.00)[]; DBL_BLOCKED_OPENRESOLVER(0.00)[tree-vect-stmts.cc:url, gcc.target:url, tree-vect-slp.cc:url, imap1.dmz-prg2.suse.org:helo, imap1.dmz-prg2.suse.org:mid] X-Spam-Score: -4.30 X-Spam-Status: No, score=-11.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org The following is a prototype for how to represent load/store-lanes within SLP. I've for now settled with having a single load node with multiple permute nodes acting as selection, one for each loaded lane and a single store node fed from all stored lanes. For for (int i = 0; i < 1024; ++i) { a[2*i] = b[2*i] + 7; a[2*i+1] = b[2*i+1] * 3; } you have the following SLP graph where I explain how things are set up and code-generated: t.c:23:21: note: SLP graph after lowering permutations: t.c:23:21: note: node 0x50dc8b0 (max_nunits=1, refcnt=1) vector(4) int t.c:23:21: note: op template: *_6 = _7; t.c:23:21: note: stmt 0 *_6 = _7; t.c:23:21: note: stmt 1 *_12 = _13; t.c:23:21: note: children 0x50dc488 0x50dc6e8 This is the store node, it's marked with ldst_lanes = true during SLP discovery. This node code-generates vect_array.65[0] = vect__7.61_29; vect_array.65[1] = vect__13.62_28; MEM [(int *)vectp_a.63_27] = .STORE_LANES (vect_array.65); ... t.c:23:21: note: node 0x50dc520 (max_nunits=4, refcnt=2) vector(4) int t.c:23:21: note: op: VEC_PERM_EXPR t.c:23:21: note: stmt 0 _5 = *_4; t.c:23:21: note: lane permutation { 0[0] } t.c:23:21: note: children 0x50dc948 t.c:23:21: note: node 0x50dc780 (max_nunits=4, refcnt=2) vector(4) int t.c:23:21: note: op: VEC_PERM_EXPR t.c:23:21: note: stmt 0 _11 = *_10; t.c:23:21: note: lane permutation { 0[1] } t.c:23:21: note: children 0x50dc948 These are the selection nodes, marked with ldst_lanes = true. They code generate nothing. t.c:23:21: note: node 0x50dc948 (max_nunits=4, refcnt=3) vector(4) int t.c:23:21: note: op template: _5 = *_4; t.c:23:21: note: stmt 0 _5 = *_4; t.c:23:21: note: stmt 1 _11 = *_10; t.c:23:21: note: load permutation { 0 1 } This is the load node, marked with ldst_lanes = true (the load permutation is only accurate when taking into account the lane permute in the selection nodes). It code generates vect_array.58 = .LOAD_LANES (MEM [(int *)vectp_b.56_33]); vect__5.59_31 = vect_array.58[0]; vect__5.60_30 = vect_array.58[1]; This scheme allows to leave code generation in vectorizable_load/store mostly as-is. While this should support both load-lanes and (masked) store-lanes the decision to do either is done during SLP discovery time and cannot be reversed without altering the SLP tree - as-is the SLP tree is not usable for non-store-lanes on the store side, the load side is OK representation-wise but will very likely fail permute handling as the lowering to deal with the two input vector restriction isn't done - but of course since the permute node is marked as to be ignored that doesn't work out. So I've put restrictions in place that fail vectorization if a load/store-lane SLP tree is later classified differently by get_load_store_type. I'll note that for example gcc.target/aarch64/sve/mask_struct_store_3.c will not get SLP store-lanes used because the full store SLPs just fine though we then fail to handle the "splat" load-permutation t2.c:5:21: note: node 0x4db2630 (max_nunits=4, refcnt=2) vector([4,4]) int t2.c:5:21: note: op template: _6 = *_5; t2.c:5:21: note: stmt 0 _6 = *_5; t2.c:5:21: note: stmt 1 _6 = *_5; t2.c:5:21: note: stmt 2 _6 = *_5; t2.c:5:21: note: stmt 3 _6 = *_5; t2.c:5:21: note: load permutation { 0 0 0 0 } the load permute lowering code currently doesn't consider it worth lowering single loads from a group (or in this case not grouped loads). The expectation is the target can handle this by two interleaves with itself. So what we see here is that while the explicit SLP representation is helpful in some cases, in cases like this it would require changing it when we make decisions how to vectorize. My idea is that this all will change a lot when we re-do SLP discovery (for loops) and when we get rid of non-SLP as I think vectorizable_* should be allowed to alter the SLP graph during analysis. Testing on RISC-V and aarch64 reveal several testcases that require adjustment as to now expect SLP even when load/store lanes are being used. If in doubt I've adjusted them to the final expectation which will lead to one or two new FAILs where we still do the SLP cancelling. I have a followup that implements that while remaining in SLP that's in final testing. I have not bothered to adjust target tests that now fail assembly-scan. * tree-vectorizer.h (_slp_tree::ldst_lanes): New flag to mark load, store and permute nodes. * tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize ldst_lanes. (vect_build_slp_instance): For stores iff the target prefers store-lanes discover single-lane sub-groups, do not perform interleaving lowering but mark the node with ldst_lanes. Also allow i == 0 - fatal failure - for splitting up a store group when we're not doing single-lane discovery already. (vect_lower_load_permutations): When the target supports load lanes and the loads all fit the pattern split out a single level of permutes only and mark the load and permute nodes with ldst_lanes. (vectorizable_slp_permutation_1): Handle the load-lane permute forwarding of vector defs. * tree-vect-stmts.cc (get_group_load_store_type): Support load/store-lanes for SLP. (vectorizable_store): Support SLP code generation for store-lanes. (vectorizable_load): Support SLP code generation for load-lanes. * gcc.dg/vect/slp-55.c: New testcase. * gcc.dg/vect/slp-56.c: Likewise. * gcc.dg/vect/slp-11c.c: Adjust. * gcc.dg/vect/slp-53.c: Likewise. * gcc.dg/vect/slp-cond-1.c: Likewise. * gcc.dg/vect/vect-complex-5.c: Likewise. * gcc.dg/vect/slp-1.c: Likewise. * gcc.dg/vect/slp-54.c: Remove riscv XFAIL. * gcc.dg/vect/slp-perm-5.c: Adjust. * gcc.dg/vect/slp-perm-7.c: Likewise. * gcc.dg/vect/slp-perm-8.c: Likewise. * gcc.dg/vect/slp-perm-9.c: Likewise. * gcc.dg/vect/slp-multitypes-11.c: Likewise. * gcc.dg/vect/slp-multitypes-11-big-array.c: Likewise. --- gcc/testsuite/gcc.dg/vect/slp-1.c | 3 +- gcc/testsuite/gcc.dg/vect/slp-11c.c | 3 +- gcc/testsuite/gcc.dg/vect/slp-53.c | 3 +- gcc/testsuite/gcc.dg/vect/slp-54.c | 2 +- gcc/testsuite/gcc.dg/vect/slp-55.c | 37 +++++ gcc/testsuite/gcc.dg/vect/slp-56.c | 51 +++++++ gcc/testsuite/gcc.dg/vect/slp-cond-1.c | 3 +- .../gcc.dg/vect/slp-multitypes-11-big-array.c | 3 +- gcc/testsuite/gcc.dg/vect/slp-multitypes-11.c | 4 +- gcc/testsuite/gcc.dg/vect/slp-perm-5.c | 5 +- gcc/testsuite/gcc.dg/vect/slp-perm-7.c | 4 +- gcc/testsuite/gcc.dg/vect/slp-perm-8.c | 6 +- gcc/testsuite/gcc.dg/vect/slp-perm-9.c | 4 +- gcc/testsuite/gcc.dg/vect/vect-complex-5.c | 3 +- gcc/tree-vect-slp.cc | 142 ++++++++++++++++-- gcc/tree-vect-stmts.cc | 127 ++++++++++++---- gcc/tree-vectorizer.h | 4 + 17 files changed, 337 insertions(+), 67 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/slp-55.c create mode 100644 gcc/testsuite/gcc.dg/vect/slp-56.c diff --git a/gcc/testsuite/gcc.dg/vect/slp-1.c b/gcc/testsuite/gcc.dg/vect/slp-1.c index d4a13f12df6..e1a45e1f1a7 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-1.c +++ b/gcc/testsuite/gcc.dg/vect/slp-1.c @@ -122,5 +122,4 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vectorized 4 loops" 1 "vect" } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" { target {! vect_strided5 } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" { target vect_strided5 } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-11c.c b/gcc/testsuite/gcc.dg/vect/slp-11c.c index 2e70fca39ba..25d7f2ce383 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-11c.c +++ b/gcc/testsuite/gcc.dg/vect/slp-11c.c @@ -45,5 +45,4 @@ int main (void) /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { { vect_uintfloat_cvt && vect_strided2 } && vect_int_mult } } } } */ /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! { { vect_uintfloat_cvt && vect_strided2 } && vect_int_mult } } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { vect_load_lanes } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { ! vect_load_lanes } } } } */ +/* { dg-final { scan-tree-dump "LOAD_LANES" "vect" { target { vect_load_lanes } } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-53.c b/gcc/testsuite/gcc.dg/vect/slp-53.c index d8cd5f85b3c..50b3e9d3cee 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-53.c +++ b/gcc/testsuite/gcc.dg/vect/slp-53.c @@ -12,4 +12,5 @@ void foo (int * __restrict x, int *y) } } -/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { vect_int && vect_int_mult } xfail vect_load_lanes } } } */ +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { vect_int && vect_int_mult } } } } */ +/* { dg-final { scan-tree-dump "LOAD_LANES" "vect" { target { vect_load_lanes } } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-54.c b/gcc/testsuite/gcc.dg/vect/slp-54.c index ab66b349d1f..57268ab50b7 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-54.c +++ b/gcc/testsuite/gcc.dg/vect/slp-54.c @@ -15,4 +15,4 @@ void foo (int * __restrict x, int *y) } } -/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { vect_int && vect_int_mult } xfail riscv*-*-* } } } */ +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { vect_int && vect_int_mult } } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-55.c b/gcc/testsuite/gcc.dg/vect/slp-55.c new file mode 100644 index 00000000000..0bf65ef6dc4 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/slp-55.c @@ -0,0 +1,37 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target vect_int_mult } */ +/* { dg-additional-options "-fdump-tree-optimized" } */ + +void foo (int * __restrict a, int *b, int *c) +{ + for (int i = 0; i < 1024; ++i) + { + a[2*i] = b[i] + 7; + a[2*i+1] = c[i] * 3; + } +} + +int bar (int *b) +{ + int res = 0; + for (int i = 0; i < 1024; ++i) + { + res += b[2*i] + 7; + res += b[2*i+1] * 3; + } + return res; +} + +void baz (int * __restrict a, int *b) +{ + for (int i = 0; i < 1024; ++i) + { + a[2*i] = b[2*i] + 7; + a[2*i+1] = b[2*i+1] * 3; + } +} + +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" } } */ +/* { dg-final { scan-tree-dump-times "LOAD_LANES" 2 "optimized" { target vect_load_lanes } } } */ +/* { dg-final { scan-tree-dump-times "STORE_LANES" 2 "optimized" { target vect_load_lanes } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-56.c b/gcc/testsuite/gcc.dg/vect/slp-56.c new file mode 100644 index 00000000000..0b985eae55e --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/slp-56.c @@ -0,0 +1,51 @@ +#include "tree-vect.h" + +/* This is a load-lane / masked-store-lane test that more reliably + triggers SLP than SVEs mask_srtuct_store_*.c */ + +void __attribute__ ((noipa)) +test4 (int *__restrict dest, int *__restrict src, + int *__restrict cond, int bias, int n) +{ + for (int i = 0; i < n; ++i) + { + int value0 = src[i * 4] + bias; + int value1 = src[i * 4 + 1] * bias; + int value2 = src[i * 4 + 2] + bias; + int value3 = src[i * 4 + 3] * bias; + if (cond[i]) + { + dest[i * 4] = value0; + dest[i * 4 + 1] = value1; + dest[i * 4 + 2] = value2; + dest[i * 4 + 3] = value3; + } + } +} + +int dest[16*4]; +int src[16*4]; +int cond[16]; +const int dest_chk[16*4] = {0, 0, 0, 0, 9, 25, 11, 35, 0, 0, 0, 0, 17, 65, 19, + 75, 0, 0, 0, 0, 25, 105, 27, 115, 0, 0, 0, 0, 33, 145, 35, 155, 0, 0, 0, + 0, 41, 185, 43, 195, 0, 0, 0, 0, 49, 225, 51, 235, 0, 0, 0, 0, 57, 265, 59, + 275, 0, 0, 0, 0, 65, 305, 67, 315}; + +int main() +{ + check_vect (); +#pragma GCC novector + for (int i = 0; i < 16; ++i) + cond[i] = i & 1; +#pragma GCC novector + for (int i = 0; i < 16 * 4; ++i) + src[i] = i; + test4 (dest, src, cond, 5, 16); +#pragma GCC novector + for (int i = 0; i < 16 * 4; ++i) + if (dest[i] != dest_chk[i]) + abort (); + return 0; +} + +/* { dg-final { scan-tree-dump "STORE_LANES" "vect" { target { vect_variable_length && vect_load_lanes } } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-cond-1.c b/gcc/testsuite/gcc.dg/vect/slp-cond-1.c index c76ea5d17ef..16ab0cc7605 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-cond-1.c +++ b/gcc/testsuite/gcc.dg/vect/slp-cond-1.c @@ -125,5 +125,4 @@ main () return 0; } -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" { target { ! vect_load_lanes } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" { target { vect_load_lanes } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 4 "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-multitypes-11-big-array.c b/gcc/testsuite/gcc.dg/vect/slp-multitypes-11-big-array.c index 2792b932734..07f871c8972 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-multitypes-11-big-array.c +++ b/gcc/testsuite/gcc.dg/vect/slp-multitypes-11-big-array.c @@ -56,5 +56,4 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_unpack } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_unpack xfail { vect_variable_length && vect_load_lanes } } } } */ - +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_unpack } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-multitypes-11.c b/gcc/testsuite/gcc.dg/vect/slp-multitypes-11.c index 5c75dc12b69..0f7b479ce59 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-multitypes-11.c +++ b/gcc/testsuite/gcc.dg/vect/slp-multitypes-11.c @@ -51,5 +51,5 @@ int main (void) /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_unpack } } } */ /* The epilogues are vectorized using partial vectors. */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { vect_unpack && { { ! vect_partial_vectors_usage_1 } || s390_vx } } xfail { vect_variable_length && vect_load_lanes } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target { { vect_unpack && vect_partial_vectors_usage_1 } && { ! s390_vx } } xfail { vect_variable_length && vect_load_lanes } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { vect_unpack && { { ! vect_partial_vectors_usage_1 } || s390_vx } } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target { { vect_unpack && vect_partial_vectors_usage_1 } && { ! s390_vx } } } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-5.c b/gcc/testsuite/gcc.dg/vect/slp-perm-5.c index 7128cf47155..0dedd4a9b86 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-perm-5.c +++ b/gcc/testsuite/gcc.dg/vect/slp-perm-5.c @@ -105,9 +105,6 @@ int main (int argc, const char* argv[]) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_perm } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target { vect_perm3_int && { ! vect_load_lanes } } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target vect_load_lanes } } } */ -/* { dg-final { scan-tree-dump "Built SLP cancelled: can use load/store-lanes" "vect" { target { vect_perm3_int && vect_load_lanes } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target { vect_perm3_int || vect_load_lanes } } } } */ /* { dg-final { scan-tree-dump "LOAD_LANES" "vect" { target vect_load_lanes } } } */ /* { dg-final { scan-tree-dump "STORE_LANES" "vect" { target vect_load_lanes } } } */ - diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-7.c b/gcc/testsuite/gcc.dg/vect/slp-perm-7.c index df13c37bc75..f15736ef729 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-perm-7.c +++ b/gcc/testsuite/gcc.dg/vect/slp-perm-7.c @@ -97,8 +97,6 @@ int main (int argc, const char* argv[]) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_perm } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { vect_perm3_int && { ! vect_load_lanes } } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target vect_load_lanes } } } */ -/* { dg-final { scan-tree-dump "Built SLP cancelled: can use load/store-lanes" "vect" { target { vect_perm3_int && vect_load_lanes } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { vect_perm3_int || vect_load_lanes } } } } */ /* { dg-final { scan-tree-dump "LOAD_LANES" "vect" { target vect_load_lanes } } } */ /* { dg-final { scan-tree-dump "STORE_LANES" "vect" { target vect_load_lanes } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-8.c b/gcc/testsuite/gcc.dg/vect/slp-perm-8.c index 029be5485b6..7610524f0bf 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-perm-8.c +++ b/gcc/testsuite/gcc.dg/vect/slp-perm-8.c @@ -61,10 +61,8 @@ int main (int argc, const char* argv[]) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_perm_byte } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { vect_perm3_byte && { { ! vect_load_lanes } && { { ! vect_partial_vectors_usage_1 } || s390_vx } } } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { vect_perm3_byte && { { ! vect_partial_vectors_usage_1 } || s390_vx } } } } } */ /* The epilogues are vectorized using partial vectors. */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target { vect_perm3_byte && { { ! vect_load_lanes } && { vect_partial_vectors_usage_1 && { ! s390_vx } } } } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target vect_load_lanes } } } */ -/* { dg-final { scan-tree-dump "Built SLP cancelled: can use load/store-lanes" "vect" { target { vect_perm3_byte && vect_load_lanes } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target { vect_perm3_byte && { vect_partial_vectors_usage_1 && { ! s390_vx } } } } } } */ /* { dg-final { scan-tree-dump "LOAD_LANES" "vect" { target vect_load_lanes } } } */ /* { dg-final { scan-tree-dump "STORE_LANES" "vect" { target vect_load_lanes } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-9.c b/gcc/testsuite/gcc.dg/vect/slp-perm-9.c index 89400fb4565..98f1d022226 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-perm-9.c +++ b/gcc/testsuite/gcc.dg/vect/slp-perm-9.c @@ -60,5 +60,5 @@ int main (int argc, const char* argv[]) vectors. */ /* { dg-final { scan-tree-dump "permutation requires at least three vectors" "vect" { target { vect_perm_short && { ! vect_perm3_short } } xfail vect_variable_length } } } */ /* { dg-final { scan-tree-dump-not "permutation requires at least three vectors" "vect" { target vect_perm3_short } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { { ! { vect_perm3_short || vect32 } } || vect_load_lanes } } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { { vect_perm3_short || vect32 } && { ! vect_load_lanes } } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { ! { vect_perm3_short || { vect32 || vect_load_lanes } } } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { vect_perm3_short || { vect32 || vect_load_lanes } } } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/vect-complex-5.c b/gcc/testsuite/gcc.dg/vect/vect-complex-5.c index ac562dc475c..0d850720d63 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-complex-5.c +++ b/gcc/testsuite/gcc.dg/vect/vect-complex-5.c @@ -40,5 +40,4 @@ main (void) return 0; } -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target vect_load_lanes } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { ! vect_load_lanes } xfail { ! vect_hw_misalign } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { ! vect_hw_misalign } } } } */ diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 2304cdac583..fe13f136552 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -121,6 +121,7 @@ _slp_tree::_slp_tree () SLP_TREE_SIMD_CLONE_INFO (this) = vNULL; SLP_TREE_DEF_TYPE (this) = vect_uninitialized_def; SLP_TREE_CODE (this) = ERROR_MARK; + this->ldst_lanes = false; SLP_TREE_VECTYPE (this) = NULL_TREE; SLP_TREE_REPRESENTATIVE (this) = NULL; SLP_TREE_REF_COUNT (this) = 1; @@ -3905,10 +3906,33 @@ vect_build_slp_instance (vec_info *vinfo, /* For loop vectorization split the RHS into arbitrary pieces of size >= 1. */ else if (is_a (vinfo) - && (i > 0 && i < group_size) - && !vect_slp_prefer_store_lanes_p (vinfo, - stmt_info, group_size, i)) - { + && (group_size != 1 && i < group_size)) + { + /* There are targets that cannot do even/odd interleaving schemes + so they absolutely need to use load/store-lanes. For now + force single-lane SLP for them - they would be happy with + uniform power-of-two lanes (but depending on element size), + but even if we can use 'i' as indicator we would need to + backtrack when later lanes fail to discover with the same + granularity. We cannot turn any of strided or scatter store + into store-lanes. */ + /* ??? If this is not in sync with what get_load_store_type + later decides the SLP representation is not good for other + store vectorization methods. */ + bool want_store_lanes + = (! STMT_VINFO_GATHER_SCATTER_P (stmt_info) + && ! STMT_VINFO_STRIDED_P (stmt_info) + && compare_step_with_zero (vinfo, stmt_info) > 0 + && vect_slp_prefer_store_lanes_p (vinfo, stmt_info, + group_size, 1)); + if (want_store_lanes) + i = 1; + + /* A fatal discovery fail doesn't always mean single-lane SLP + isn't a possibility, so try. */ + if (i == 0) + i = 1; + if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "Splitting SLP group at stmt %u\n", i); @@ -3942,7 +3966,10 @@ vect_build_slp_instance (vec_info *vinfo, (max_nunits, end - start)); rhs_nodes.safe_push (node); start = end; - end = group_size; + if (want_store_lanes) + end = start + 1; + else + end = group_size; } else { @@ -3976,7 +4003,31 @@ vect_build_slp_instance (vec_info *vinfo, } /* Now we assume we can build the root SLP node from all stores. */ - node = vect_build_slp_store_interleaving (rhs_nodes, scalar_stmts); + if (want_store_lanes) + { + /* For store-lanes feed the store node with all RHS nodes + in order. */ + node = vect_create_new_slp_node (scalar_stmts, + SLP_TREE_CHILDREN + (rhs_nodes[0]).length ()); + SLP_TREE_VECTYPE (node) = SLP_TREE_VECTYPE (rhs_nodes[0]); + node->ldst_lanes = true; + SLP_TREE_CHILDREN (node) + .reserve_exact (SLP_TREE_CHILDREN (rhs_nodes[0]).length () + + rhs_nodes.length () - 1); + /* First store value and possibly mask. */ + SLP_TREE_CHILDREN (node) + .splice (SLP_TREE_CHILDREN (rhs_nodes[0])); + /* Rest of the store values. All mask nodes are the same, + this should be guaranteed by dataref group discovery. */ + for (unsigned j = 1; j < rhs_nodes.length (); ++j) + SLP_TREE_CHILDREN (node) + .quick_push (SLP_TREE_CHILDREN (rhs_nodes[j])[0]); + for (slp_tree child : SLP_TREE_CHILDREN (node)) + child->refcnt++; + } + else + node = vect_build_slp_store_interleaving (rhs_nodes, scalar_stmts); while (!rhs_nodes.is_empty ()) vect_free_slp_tree (rhs_nodes.pop ()); @@ -4184,12 +4235,50 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo, lower. */ stmt_vec_info first = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]); + unsigned group_lanes = DR_GROUP_SIZE (first); + + /* Verify if all load permutations can be implemented with a suitably + large element load-lanes operation. */ + unsigned ld_lanes_lanes = SLP_TREE_LANES (loads[0]); + if (STMT_VINFO_STRIDED_P (first) + || compare_step_with_zero (loop_vinfo, first) <= 0 + || exact_log2 (ld_lanes_lanes) == -1 + /* ??? For now only support the single-lane case as there is + missing support on the store-lane side and code generation + isn't up to the task yet. */ + || ld_lanes_lanes != 1 + || vect_load_lanes_supported (SLP_TREE_VECTYPE (loads[0]), + group_lanes / ld_lanes_lanes, + false) == IFN_LAST) + ld_lanes_lanes = 0; + else + /* Verify the loads access the same number of lanes aligned to + ld_lanes_lanes. */ + for (slp_tree load : loads) + { + if (SLP_TREE_LANES (load) != ld_lanes_lanes) + { + ld_lanes_lanes = 0; + break; + } + unsigned first = SLP_TREE_LOAD_PERMUTATION (load)[0]; + if (first % ld_lanes_lanes != 0) + { + ld_lanes_lanes = 0; + break; + } + for (unsigned i = 1; i < SLP_TREE_LANES (load); ++i) + if (SLP_TREE_LOAD_PERMUTATION (load)[i] != first + i) + { + ld_lanes_lanes = 0; + break; + } + } /* Only a power-of-two number of lanes matches interleaving with N levels. ??? An even number of lanes could be reduced to 1<= (group_lanes + 1) / 2) + if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2 + && ld_lanes_lanes == 0) continue; /* First build (and possibly re-use) a load node for the @@ -4239,10 +4329,20 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo, final_perm.quick_push (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i])); + if (ld_lanes_lanes != 0) + { + /* ??? If this is not in sync with what get_load_store_type + later decides the SLP representation is not good for other + store vectorization methods. */ + l0->ldst_lanes = true; + load->ldst_lanes = true; + } + while (1) { unsigned group_lanes = SLP_TREE_LANES (l0); - if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2) + if (ld_lanes_lanes != 0 + || SLP_TREE_LANES (load) >= (group_lanes + 1) / 2) break; /* Try to lower by reducing the group to half its size using an @@ -9877,6 +9977,28 @@ vectorizable_slp_permutation_1 (vec_info *vinfo, gimple_stmt_iterator *gsi, gcc_assert (perm.length () == SLP_TREE_LANES (node)); + /* Load-lanes permute. This permute only acts as a forwarder to + select the correct vector def of the load-lanes load which + has the permuted vectors in its vector defs like + { v0, w0, r0, v1, w1, r1 ... } for a ld3. */ + if (node->ldst_lanes) + { + gcc_assert (children.length () == 1); + if (!gsi) + /* This is a trivial op always supported. */ + return 1; + slp_tree child = children[0]; + unsigned vec_idx = (SLP_TREE_LANE_PERMUTATION (node)[0].second + / SLP_TREE_LANES (node)); + unsigned vec_num = SLP_TREE_LANES (child) / SLP_TREE_LANES (node); + for (unsigned i = 0; i < SLP_TREE_NUMBER_OF_VEC_STMTS (node); ++i) + { + tree def = SLP_TREE_VEC_DEFS (child)[i * vec_num + vec_idx]; + node->push_vec_def (def); + } + return 1; + } + /* REPEATING_P is true if every output vector is guaranteed to use the same permute vector. We can handle that case for both variable-length and constant-length vectors, but we only handle other cases for diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index 72a29c0584b..d2282c0dc4f 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -1509,7 +1509,8 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype, unsigned int nvectors; if (slp_node) - nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); + /* ??? Incorrect for multi-lane lanes. */ + nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node) / group_size; else nvectors = vect_get_num_copies (loop_vinfo, vectype); @@ -1795,7 +1796,7 @@ vect_use_strided_gather_scatters_p (stmt_vec_info stmt_info, elements with a known constant step. Return -1 if that step is negative, 0 if it is zero, and 1 if it is greater than zero. */ -static int +int compare_step_with_zero (vec_info *vinfo, stmt_vec_info stmt_info) { dr_vec_info *dr_info = STMT_VINFO_DR_INFO (stmt_info); @@ -2070,6 +2071,14 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, is irrelevant for them. */ *alignment_support_scheme = dr_unaligned_supported; } + /* Try using LOAD/STORE_LANES. */ + else if (slp_node->ldst_lanes + && (*lanes_ifn + = (vls_type == VLS_LOAD + ? vect_load_lanes_supported (vectype, group_size, masked_p) + : vect_store_lanes_supported (vectype, group_size, + masked_p))) != IFN_LAST) + *memory_access_type = VMAT_LOAD_STORE_LANES; else *memory_access_type = VMAT_CONTIGUOUS; @@ -8201,6 +8210,16 @@ vectorizable_store (vec_info *vinfo, &lanes_ifn)) return false; + if (slp_node + && slp_node->ldst_lanes + && memory_access_type != VMAT_LOAD_STORE_LANES) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "discovered store-lane but cannot use it.\n"); + return false; + } + if (mask) { if (memory_access_type == VMAT_CONTIGUOUS) @@ -8717,7 +8736,7 @@ vectorizable_store (vec_info *vinfo, else { if (memory_access_type == VMAT_LOAD_STORE_LANES) - aggr_type = build_array_type_nelts (elem_type, vec_num * nunits); + aggr_type = build_array_type_nelts (elem_type, group_size * nunits); else aggr_type = vectype; bump = vect_get_data_ptr_increment (vinfo, gsi, dr_info, aggr_type, @@ -8774,11 +8793,24 @@ vectorizable_store (vec_info *vinfo, if (memory_access_type == VMAT_LOAD_STORE_LANES) { - gcc_assert (!slp && grouped_store); + if (costing_p && slp_node) + /* Update all incoming store operand nodes, the general handling + above only handles the mask and the first store operand node. */ + for (slp_tree child : SLP_TREE_CHILDREN (slp_node)) + if (child != mask_node + && !vect_maybe_update_slp_op_vectype (child, vectype)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "incompatible vector types for invariants\n"); + return false; + } unsigned inside_cost = 0, prologue_cost = 0; /* For costing some adjacent vector stores, we'd like to cost with the total number of them once instead of cost each one by one. */ unsigned int n_adjacent_stores = 0; + if (slp) + ncopies = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node) / group_size; for (j = 0; j < ncopies; j++) { gimple *new_stmt; @@ -8796,7 +8828,7 @@ vectorizable_store (vec_info *vinfo, op = vect_get_store_rhs (next_stmt_info); if (costing_p) update_prologue_cost (&prologue_cost, op); - else + else if (!slp) { vect_get_vec_defs_for_operand (vinfo, next_stmt_info, ncopies, op, @@ -8811,15 +8843,15 @@ vectorizable_store (vec_info *vinfo, { if (mask) { - vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies, - mask, &vec_masks, - mask_vectype); + if (slp_node) + vect_get_slp_defs (mask_node, &vec_masks); + else + vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies, + mask, &vec_masks, + mask_vectype); vec_mask = vec_masks[0]; } - /* We should have catched mismatched types earlier. */ - gcc_assert ( - useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd))); dataref_ptr = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type, NULL, offset, &dummy, @@ -8831,10 +8863,16 @@ vectorizable_store (vec_info *vinfo, gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo)); /* DR_CHAIN is then used as an input to vect_permute_store_chain(). */ - for (i = 0; i < group_size; i++) + if (!slp) { - vec_oprnd = (*gvec_oprnds[i])[j]; - dr_chain[i] = vec_oprnd; + /* We should have caught mismatched types earlier. */ + gcc_assert ( + useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd))); + for (i = 0; i < group_size; i++) + { + vec_oprnd = (*gvec_oprnds[i])[j]; + dr_chain[i] = vec_oprnd; + } } if (mask) vec_mask = vec_masks[j]; @@ -8844,12 +8882,12 @@ vectorizable_store (vec_info *vinfo, if (costing_p) { - n_adjacent_stores += vec_num; + n_adjacent_stores += group_size; continue; } /* Get an array into which we can store the individual vectors. */ - tree vec_array = create_vector_array (vectype, vec_num); + tree vec_array = create_vector_array (vectype, group_size); /* Invalidate the current contents of VEC_ARRAY. This should become an RTL clobber too, which prevents the vector registers @@ -8857,9 +8895,19 @@ vectorizable_store (vec_info *vinfo, vect_clobber_variable (vinfo, stmt_info, gsi, vec_array); /* Store the individual vectors into the array. */ - for (i = 0; i < vec_num; i++) + for (i = 0; i < group_size; i++) { - vec_oprnd = dr_chain[i]; + if (slp) + { + slp_tree child; + if (i == 0 || !mask_node) + child = SLP_TREE_CHILDREN (slp_node)[i]; + else + child = SLP_TREE_CHILDREN (slp_node)[i + 1]; + vec_oprnd = SLP_TREE_VEC_DEFS (child)[j]; + } + else + vec_oprnd = dr_chain[i]; write_vector_array (vinfo, stmt_info, gsi, vec_oprnd, vec_array, i); } @@ -8929,9 +8977,10 @@ vectorizable_store (vec_info *vinfo, /* Record that VEC_ARRAY is now dead. */ vect_clobber_variable (vinfo, stmt_info, gsi, vec_array); - if (j == 0) + if (j == 0 && !slp) *vec_stmt = new_stmt; - STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt); + if (!slp) + STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt); } if (costing_p) @@ -10035,6 +10084,16 @@ vectorizable_load (vec_info *vinfo, &lanes_ifn)) return false; + if (slp_node + && slp_node->ldst_lanes + && memory_access_type != VMAT_LOAD_STORE_LANES) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "discovered load-lane but cannot use it.\n"); + return false; + } + if (mask) { if (memory_access_type == VMAT_CONTIGUOUS) @@ -10753,7 +10812,7 @@ vectorizable_load (vec_info *vinfo, else { if (memory_access_type == VMAT_LOAD_STORE_LANES) - aggr_type = build_array_type_nelts (elem_type, vec_num * nunits); + aggr_type = build_array_type_nelts (elem_type, group_size * nunits); else aggr_type = vectype; bump = vect_get_data_ptr_increment (vinfo, gsi, dr_info, aggr_type, @@ -10777,12 +10836,13 @@ vectorizable_load (vec_info *vinfo, { gcc_assert (alignment_support_scheme == dr_aligned || alignment_support_scheme == dr_unaligned_supported); - gcc_assert (grouped_load && !slp); unsigned int inside_cost = 0, prologue_cost = 0; /* For costing some adjacent vector loads, we'd like to cost with the total number of them once instead of cost each one by one. */ unsigned int n_adjacent_loads = 0; + if (slp_node) + ncopies = slp_node->vec_stmts_size / group_size; for (j = 0; j < ncopies; j++) { if (costing_p) @@ -10833,7 +10893,7 @@ vectorizable_load (vec_info *vinfo, if (mask) vec_mask = vec_masks[j]; - tree vec_array = create_vector_array (vectype, vec_num); + tree vec_array = create_vector_array (vectype, group_size); tree final_mask = NULL_TREE; tree final_len = NULL_TREE; @@ -10896,24 +10956,31 @@ vectorizable_load (vec_info *vinfo, gimple_call_set_nothrow (call, true); vect_finish_stmt_generation (vinfo, stmt_info, call, gsi); - dr_chain.create (vec_num); + if (!slp) + dr_chain.create (group_size); /* Extract each vector into an SSA_NAME. */ - for (i = 0; i < vec_num; i++) + for (unsigned i = 0; i < group_size; i++) { new_temp = read_vector_array (vinfo, stmt_info, gsi, scalar_dest, vec_array, i); - dr_chain.quick_push (new_temp); + if (slp) + slp_node->push_vec_def (new_temp); + else + dr_chain.quick_push (new_temp); } - /* Record the mapping between SSA_NAMEs and statements. */ - vect_record_grouped_load_vectors (vinfo, stmt_info, dr_chain); + if (!slp) + /* Record the mapping between SSA_NAMEs and statements. */ + vect_record_grouped_load_vectors (vinfo, stmt_info, dr_chain); /* Record that VEC_ARRAY is now dead. */ vect_clobber_variable (vinfo, stmt_info, gsi, vec_array); - dr_chain.release (); + if (!slp) + dr_chain.release (); - *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0]; + if (!slp_node) + *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0]; } if (costing_p) diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index df6c8ada2f7..699ae9e33ba 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -222,6 +222,9 @@ struct _slp_tree { unsigned int lanes; /* The operation of this node. */ enum tree_code code; + /* Whether uses of this load or feeders of this store are suitable + for load/store-lanes. */ + bool ldst_lanes; int vertex; @@ -2313,6 +2316,7 @@ extern bool supportable_indirect_convert_operation (code_helper, tree, tree, vec > *, tree = NULL_TREE); +extern int compare_step_with_zero (vec_info *, stmt_vec_info); extern unsigned record_stmt_cost (stmt_vector_for_cost *, int, enum vect_cost_for_stmt, stmt_vec_info,