From patchwork Fri Oct 18 13:12:59 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Craig Blackmore X-Patchwork-Id: 1999113 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=embecosm.com header.i=@embecosm.com header.a=rsa-sha256 header.s=google header.b=AB8RLMBU; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4XVQDn2mmmz1xth for ; Sat, 19 Oct 2024 00:15:37 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 8DFC53858405 for ; Fri, 18 Oct 2024 13:15:35 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mail-wm1-x330.google.com (mail-wm1-x330.google.com [IPv6:2a00:1450:4864:20::330]) by sourceware.org (Postfix) with ESMTPS id 2B25C3858404 for ; Fri, 18 Oct 2024 13:14:33 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 2B25C3858404 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=embecosm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=embecosm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 2B25C3858404 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::330 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1729257278; cv=none; b=vzqnfZAc1JZRnSOTsma6qXTWF3uN9eImkY8KwfDqNHXWWnHfCdnvzk5qaUcji2pcdTqg79pN+LB+6RmewgM0bI6jbD3293XzitktTkxsC6osmeVLdbUauAFtpMKOlmhyGhYnyr8TozZw9TNhLKF/O3IuvYeTWfImHucNR2E+eH0= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1729257278; c=relaxed/simple; bh=3YgSGYjHc0DxB53FgBtDTkN7crmDm3lEZs3Cc1y0lbA=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=FcMc/SRhVczvf9453JgdVfNjX6Tfj5PzldpPXmRiuhCH0uJQ/jvQo8TfhFClkZIgyya9QDBETUmNqPBdLfuczMO1YYQvYKY1KGGq1rJvaUTA4v6fZxWdTRUvjILghtACdaqb1VRFiEQ/4x9tlN2kuDkNtGjFO02LlHRAEbtFL7A= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-wm1-x330.google.com with SMTP id 5b1f17b1804b1-43155afca99so16758485e9.1 for ; Fri, 18 Oct 2024 06:14:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=embecosm.com; s=google; t=1729257272; x=1729862072; darn=gcc.gnu.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=CsMwWjuIx1ZCkJtxR8fz2Oc/UTK26btOgALS/BenQeU=; b=AB8RLMBUoYfYQSrDSG6PB/C2s7ACC9/rd5gWbBxDXmoymBGC1EAYRYcJ6rxmTxa+uz lbICrbFZDR7TPt3s+LAMYjbJCvBQmL6RnPowtJmfw6zV74p9j36HdZMyX+eFkE/Dofay duVZ1Yk3H8kCIWB8M7Tc11ZujlMBKXUoS5KBm7bYqKJ00Lx/Pb4fRoctnu9YLnk2BkMi ZDZWvZyRgbPCkimWpnZrLxI0oHH0pH4wkc7to+FHKG5V5fWFweaKyx+Ug4Fh4S0Nk8Bf 0eRqjZdOBJZue4mh+WaTLNYT0AkrynHg+px0BrhIFju5e+QaQXBdmRk6QmNO30qw+pO3 T8vw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729257272; x=1729862072; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=CsMwWjuIx1ZCkJtxR8fz2Oc/UTK26btOgALS/BenQeU=; b=Aebe5hpodnWVvdtE1HlTctbfPtejpc9FuYIIMD6iOukYT89aJYrP8Qr/XNwrQ7m/RO wFUp6kqwSZ+962MLnjAe4Zvae20vREclaUjd2V5QpsnYLiQ6cf0bT9XA8brKQx2cLKaQ m0ZRHIIdWXLiWbdJZqghhiRcTtRxJvZNI6nas9hMoljMAfPc/yLfpgiLeoTFnb5NPDHT ER05pMAuhb0tu1ugAk6TNbFoDMpZc3c4OJ4bIj2muKvIDFjxCRy5eQN7/P9rj5C8HmIY ALtYyn5T+xRZrFB+F3iBWE0TskH1vSl4NStScyH1bVfoZjSk/vtqRRXVfhSQMk01+s5O 1Seg== X-Gm-Message-State: AOJu0YygdYSB/aPrlsRo2cMxWfkaD6hy5auS8O/bB3q6R/8QhbRMkrY+ d5cbWN/yS6BUBiQsOHa+61wGseEDdIftEfC3YO9U7UgJInGUpiYnvIPVn2WhNNjmaQ+wL4gI9tU M X-Google-Smtp-Source: AGHT+IHvDPgm3B7SJV4Jn0yS8cb9D0X6U/mr0fopsUdZHQI4MvmuJoGpA03b0Ulv1Ex75IjyjMerfg== X-Received: by 2002:a05:600c:34d3:b0:431:55af:a220 with SMTP id 5b1f17b1804b1-4316168501fmr15236415e9.12.1729257271509; Fri, 18 Oct 2024 06:14:31 -0700 (PDT) Received: from dorian.. (sals-04-b2-v4wan-167965-cust660.vm36.cable.virginm.net. [80.3.10.149]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-43160dc9a89sm23577435e9.16.2024.10.18.06.14.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 18 Oct 2024 06:14:31 -0700 (PDT) From: Craig Blackmore To: gcc-patches@gcc.gnu.org Cc: Craig Blackmore Subject: [PATCH 6/7] RISC-V: Make vectorized memset handle more cases Date: Fri, 18 Oct 2024 14:12:59 +0100 Message-ID: <20241018131300.1150819-7-craig.blackmore@embecosm.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20241018131300.1150819-1-craig.blackmore@embecosm.com> References: <20241018131300.1150819-1-craig.blackmore@embecosm.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org `expand_vec_setmem` only generated vectorized memset if it fitted into a single vector store. Extend it to generate a loop for longer and unknown lengths. The test cases now use -O1 so that they are not sensitive to scheduling. gcc/ChangeLog: * config/riscv/riscv-string.cc (use_vector_stringop_p): Add comment. (expand_vec_setmem): Use use_vector_stringop_p instead of check_vectorise_memory_operation. Add loop generation. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/base/setmem-1.c: Use -O1. Expect a loop instead of a libcall. Add test for unknown length. * gcc.target/riscv/rvv/base/setmem-2.c: Likewise. * gcc.target/riscv/rvv/base/setmem-3.c: Likewise and expect smaller lmul. --- gcc/config/riscv/riscv-string.cc | 83 ++++++++++++++----- .../gcc.target/riscv/rvv/base/setmem-1.c | 37 ++++++++- .../gcc.target/riscv/rvv/base/setmem-2.c | 37 ++++++++- .../gcc.target/riscv/rvv/base/setmem-3.c | 41 +++++++-- 4 files changed, 160 insertions(+), 38 deletions(-) diff --git a/gcc/config/riscv/riscv-string.cc b/gcc/config/riscv/riscv-string.cc index 118c02a4021..91b0ec03118 100644 --- a/gcc/config/riscv/riscv-string.cc +++ b/gcc/config/riscv/riscv-string.cc @@ -1062,6 +1062,9 @@ struct stringop_info { MAX_EW is the maximum element width that the caller wants to use and LENGTH_IN is the length of the stringop in bytes. + + This is currently used for cpymem and setmem. If expand_vec_cmpmem switches + to using it too then check_vectorise_memory_operation can be removed. */ static bool @@ -1600,41 +1603,75 @@ check_vectorise_memory_operation (rtx length_in, HOST_WIDE_INT &lmul_out) bool expand_vec_setmem (rtx dst_in, rtx length_in, rtx fill_value_in) { - HOST_WIDE_INT lmul; + stringop_info info; + /* Check we are able and allowed to vectorise this operation; bail if not. */ - if (!check_vectorise_memory_operation (length_in, lmul)) + if (!use_vector_stringop_p (info, 1, length_in)) return false; - machine_mode vmode - = riscv_vector::get_vector_mode (QImode, BYTES_PER_RISCV_VECTOR * lmul) - .require (); + /* avl holds the (remaining) length of the required set. + cnt holds the length we set with the current store. */ + rtx cnt = info.avl; rtx dst_addr = copy_addr_to_reg (XEXP (dst_in, 0)); - rtx dst = change_address (dst_in, vmode, dst_addr); + rtx dst = change_address (dst_in, info.vmode, dst_addr); - rtx fill_value = gen_reg_rtx (vmode); + rtx fill_value = gen_reg_rtx (info.vmode); rtx broadcast_ops[] = { fill_value, fill_value_in }; - /* If the length is exactly vlmax for the selected mode, do that. - Otherwise, use a predicated store. */ - if (known_eq (GET_MODE_SIZE (vmode), INTVAL (length_in))) + rtx label = NULL_RTX; + rtx mask = NULL_RTX; + + /* If we don't need a loop and the length is exactly vlmax for the selected + mode do a broadcast and store, otherwise use a predicated store. */ + if (!info.need_loop + && known_eq (GET_MODE_SIZE (info.vmode), INTVAL (length_in))) { - emit_vlmax_insn (code_for_pred_broadcast (vmode), UNARY_OP, - broadcast_ops); + emit_vlmax_insn (code_for_pred_broadcast (info.vmode), UNARY_OP, + broadcast_ops); emit_move_insn (dst, fill_value); + return true; } - else + + machine_mode mask_mode + = riscv_vector::get_vector_mode (BImode, + GET_MODE_NUNITS (info.vmode)).require (); + mask = CONSTM1_RTX (mask_mode); + if (!satisfies_constraint_K (cnt)) + cnt = force_reg (Pmode, cnt); + + if (info.need_loop) { - if (!satisfies_constraint_K (length_in)) - length_in = force_reg (Pmode, length_in); - emit_nonvlmax_insn (code_for_pred_broadcast (vmode), UNARY_OP, - broadcast_ops, length_in); - machine_mode mask_mode - = riscv_vector::get_vector_mode (BImode, GET_MODE_NUNITS (vmode)) - .require (); - rtx mask = CONSTM1_RTX (mask_mode); - emit_insn (gen_pred_store (vmode, dst, mask, fill_value, length_in, - get_avl_type_rtx (riscv_vector::NONVLMAX))); + info.avl = copy_to_mode_reg (Pmode, info.avl); + cnt = gen_reg_rtx (Pmode); + emit_insn (riscv_vector::gen_no_side_effects_vsetvl_rtx (info.vmode, cnt, + info.avl)); + } + + emit_nonvlmax_insn (code_for_pred_broadcast (info.vmode), + riscv_vector::UNARY_OP, broadcast_ops, cnt); + + if (info.need_loop) + { + label = gen_label_rtx (); + + emit_label (label); + emit_insn (riscv_vector::gen_no_side_effects_vsetvl_rtx (info.vmode, cnt, + info.avl)); + } + + emit_insn (gen_pred_store (info.vmode, dst, mask, fill_value, cnt, + get_avl_type_rtx (riscv_vector::NONVLMAX))); + + if (info.need_loop) + { + emit_insn (gen_rtx_SET (dst_addr, gen_rtx_PLUS (Pmode, dst_addr, cnt))); + emit_insn (gen_rtx_SET (info.avl, gen_rtx_MINUS (Pmode, info.avl, cnt))); + + /* Emit the loop condition. */ + rtx test = gen_rtx_NE (VOIDmode, info.avl, const0_rtx); + emit_jump_insn (gen_cbranch4 (Pmode, test, info.avl, const0_rtx, label)); + emit_insn (gen_nop ()); } return true; diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/setmem-1.c b/gcc/testsuite/gcc.target/riscv/rvv/base/setmem-1.c index 22844ff348c..32d85ea4f14 100644 --- a/gcc/testsuite/gcc.target/riscv/rvv/base/setmem-1.c +++ b/gcc/testsuite/gcc.target/riscv/rvv/base/setmem-1.c @@ -1,6 +1,6 @@ /* { dg-do compile } */ /* { dg-add-options riscv_v } */ -/* { dg-additional-options "-O3 -mrvv-max-lmul=dynamic" } */ +/* { dg-additional-options "-O1 -mrvv-max-lmul=dynamic" } */ /* { dg-final { check-function-bodies "**" "" } } */ #define MIN_VECTOR_BYTES (__riscv_v_min_vlen / 8) @@ -91,13 +91,42 @@ f6 (void *a, int const b) return __builtin_memset (a, b, MIN_VECTOR_BYTES * 8); } -/* Don't vectorise if the move is too large for one operation. +/* Vectorise with loop for larger lengths ** f7: -** li\s+a2,\d+ -** tail\s+memset +** mv\s+[ta][0-7],a0 +** li\s+[ta][0-7],129 +** vsetvli\s+zero,[ta][0-7],e8,m8,ta,ma +** vmv.v.x\s+v8,a1 +XX \.L\d+: +** vsetvli\s+[ta][0-7],[ta][0-7],e8,m8,ta,ma +** vse8.v\s+v8,0\(a[0-9]\) +** add\s+[ta][0-7],[ta][0-7],[ta][0-7] +** sub\s+[ta][0-7],[ta][0-7],[ta][0-7] +** bne\s+[ta][0-7],zero,\.L\d+ +** ret */ void * f7 (void *a, int const b) { return __builtin_memset (a, b, MIN_VECTOR_BYTES * 8 + 1); } + +/* Vectorize with loop for unknown length. +** f8: +** mv\s+[ta][0-7],a0 +** mv\s+[ta][0-7],a2 +** vsetvli\s+zero,[ta][0-7],e8,m8,ta,ma +** vmv.v.x\s+v8,a1 +XX \.L\d+: +** vsetvli\s+[ta][0-7],[ta][0-7],e8,m8,ta,ma +** vse8.v\s+v8,0\(a[0-9]\) +** add\s+[ta][0-7],[ta][0-7],[ta][0-7] +** sub\s+[ta][0-7],[ta][0-7],[ta][0-7] +** bne\s+[ta][0-7],zero,\.L\d+ +** ret +*/ +void * +f8 (void *a, int const b, int n) +{ + return __builtin_memset (a, b, n); +} diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/setmem-2.c b/gcc/testsuite/gcc.target/riscv/rvv/base/setmem-2.c index faea442a4bd..9da1c9309d8 100644 --- a/gcc/testsuite/gcc.target/riscv/rvv/base/setmem-2.c +++ b/gcc/testsuite/gcc.target/riscv/rvv/base/setmem-2.c @@ -1,6 +1,6 @@ /* { dg-do compile } */ /* { dg-add-options riscv_v } */ -/* { dg-additional-options "-O3 -mrvv-max-lmul=m1" } */ +/* { dg-additional-options "-O1 -mrvv-max-lmul=m1" } */ /* { dg-final { check-function-bodies "**" "" } } */ #define MIN_VECTOR_BYTES (__riscv_v_min_vlen / 8) @@ -39,13 +39,42 @@ f2 (void *a, int const b) return __builtin_memset (a, b, MIN_VECTOR_BYTES); } -/* Don't vectorise if the move is too large for requested lmul. +/* Vectorise with loop for larger lengths ** f3: -** li\s+a2,\d+ -** tail\s+memset +** mv\s+[ta][0-7],a0 +** li\s+[ta][0-7],17 +** vsetvli\s+zero,[ta][0-7],e8,m1,ta,ma +** vmv.v.x\s+v1,a1 +XX \.L\d+: +** vsetvli\s+[ta][0-7],[ta][0-7],e8,m1,ta,ma +** vse8.v\s+v1,0\(a[0-9]\) +** add\s+[ta][0-7],[ta][0-7],[ta][0-7] +** sub\s+[ta][0-7],[ta][0-7],[ta][0-7] +** bne\s+[ta][0-7],zero,\.L\d+ +** ret */ void * f3 (void *a, int const b) { return __builtin_memset (a, b, MIN_VECTOR_BYTES + 1); } + +/* Vectorize with loop for unknown length. +** f4: +** mv\s+[ta][0-7],a0 +** mv\s+[ta][0-7],a2 +** vsetvli\s+zero,[ta][0-7],e8,m1,ta,ma +** vmv.v.x\s+v1,a1 +XX \.L\d+: +** vsetvli\s+[ta][0-7],[ta][0-7],e8,m1,ta,ma +** vse8.v\s+v1,0\(a[0-9]\) +** add\s+[ta][0-7],[ta][0-7],[ta][0-7] +** sub\s+[ta][0-7],[ta][0-7],[ta][0-7] +** bne\s+[ta][0-7],zero,\.L\d+ +** ret +*/ +void * +f4 (void *a, int const b, int n) +{ + return __builtin_memset (a, b, n); +} diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/setmem-3.c b/gcc/testsuite/gcc.target/riscv/rvv/base/setmem-3.c index 25be694d248..2111a139ad4 100644 --- a/gcc/testsuite/gcc.target/riscv/rvv/base/setmem-3.c +++ b/gcc/testsuite/gcc.target/riscv/rvv/base/setmem-3.c @@ -1,6 +1,6 @@ /* { dg-do compile } */ /* { dg-add-options riscv_v } */ -/* { dg-additional-options "-O3 -mrvv-max-lmul=m8" } */ +/* { dg-additional-options "-O1 -mrvv-max-lmul=m8" } */ /* { dg-final { check-function-bodies "**" "" } } */ #define MIN_VECTOR_BYTES (__riscv_v_min_vlen / 8) @@ -21,13 +21,13 @@ f1 (void *a, int const b) return __builtin_memset (a, b, MIN_VECTOR_BYTES - 1); } -/* Vectorise+inline minimum vector register width using requested lmul. +/* Vectorised code should use smallest lmul known to fit length. ** f2: ** ( -** vsetivli\s+zero,\d+,e8,m8,ta,ma +** vsetivli\s+zero,\d+,e8,m1,ta,ma ** | ** li\s+a\d+,\d+ -** vsetvli\s+zero,a\d+,e8,m8,ta,ma +** vsetvli\s+zero,a\d+,e8,m1,ta,ma ** ) ** vmv\.v\.x\s+v\d+,a1 ** vse8\.v\s+v\d+,0\(a0\) @@ -57,13 +57,40 @@ f3 (void *a, int const b) return __builtin_memset (a, b, MIN_VECTOR_BYTES * 8); } -/* Don't vectorise if the move is too large for requested lmul. +/* Vectorise with loop for larger lengths ** f4: -** li\s+a2,\d+ -** tail\s+memset +** mv\s+[ta][0-7],a0 +** li\s+[ta][0-7],129 +** vsetvli\s+zero,[ta][0-7],e8,m8,ta,ma +** vmv.v.x\s+v8,a1 +** vsetvli\s+[ta][0-7],[ta][0-7],e8,m8,ta,ma +** vse8.v\s+v8,0\(a[0-9]\) +** add\s+[ta][0-7],[ta][0-7],[ta][0-7] +** sub\s+[ta][0-7],[ta][0-7],[ta][0-7] +** bne\s+[ta][0-7],zero,\.L\d+ +** ret */ void * f4 (void *a, int const b) { return __builtin_memset (a, b, MIN_VECTOR_BYTES * 8 + 1); } + +/* Vectorize with loop for unknown length. +** f5: +** mv\s+[ta][0-7],a0 +** mv\s+[ta][0-7],a2 +** vsetvli\s+zero,[ta][0-7],e8,m8,ta,ma +** vmv.v.x\s+v8,a1 +** vsetvli\s+[ta][0-7],[ta][0-7],e8,m8,ta,ma +** vse8.v\s+v8,0\(a[0-9]\) +** add\s+[ta][0-7],[ta][0-7],[ta][0-7] +** sub\s+[ta][0-7],[ta][0-7],[ta][0-7] +** bne\s+[ta][0-7],zero,\.L\d+ +** ret +*/ +void * +f5 (void *a, int const b, int n) +{ + return __builtin_memset (a, b, n); +}