From patchwork Thu May 13 09:23:07 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hongtao Liu X-Patchwork-Id: 1477973 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=gcc-patches-bounces@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.a=rsa-sha256 header.s=default header.b=Kd6xB9XJ; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4FgmLL1JWqz9sW4 for ; Thu, 13 May 2021 19:18:48 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 42B3B393F841; Thu, 13 May 2021 09:18:45 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 42B3B393F841 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1620897525; bh=2yjyUVaM0VqCVCXHXx17W5jYM/lz4hAJxXpBYUPtvWk=; h=Date:Subject:To:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=Kd6xB9XJSR8PfyH/TjLMEg/N4ruog/etW/xv8vZZMaBsY+icpr6+ge+Fm5CgPor6l BXALriEeEIu8r0vaYHUJZIBu3sK3pryqhOinQmLGLft2nwqyQ7cR3PWqujtu+94lyg FmRUUBP8H/LcSa+R5AVJs3XM/qojkEiV6J4/I/IY= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mail-vs1-xe32.google.com (mail-vs1-xe32.google.com [IPv6:2607:f8b0:4864:20::e32]) by sourceware.org (Postfix) with ESMTPS id 161D83939C1B for ; Thu, 13 May 2021 09:18:39 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 161D83939C1B Received: by mail-vs1-xe32.google.com with SMTP id j13so13331229vsf.2 for ; Thu, 13 May 2021 02:18:39 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to:cc; bh=2yjyUVaM0VqCVCXHXx17W5jYM/lz4hAJxXpBYUPtvWk=; b=dkJsGu3xopnRGxhGDhonPq1ZmOht2lDNaSh0/ZccJc7VUtnNtFwH3y4kqpoQjans91 9qc9TXXvBYwSnFfYb4F1NR6bQIg6jjckB02bupRNRK2ERPlfFmuWRwkdV7xWt2MBXlAh bSfeeBoWtU0g8z0JmxoSc625jxWvLO8uMgQYINQJ7S8WSBf/cQCpjybgpmKLqItA5uZT D9NkiYNo6b6BNKkxAGK//joHnN0OZTqBAmBYXif4M4XKry0FqveRmOV91vRteOVDN9eB wvjvM7cHKUdd35cPqKj06Ayd28Tgn0Ld1gExgpbYZYE/oh8n/jmyZl5W5IF6cUHphyJ5 1nUQ== X-Gm-Message-State: AOAM533/KxFvu9YPHQzzbE2dHl42ON+LqcRXnc0V35azMMMH8l6gjZFv p9bchJY02+OjZ6YTPlEjrOKV3apkx68UGT9V8xdNGiZrR7iHLg== X-Google-Smtp-Source: ABdhPJxMhMnEynwIeYPRuCp7L81dQk58LVs/e36KBjFFYox2u8dS00SHKxNaJSP85GE5voG870C+HVhdtidQheO/ZlE= X-Received: by 2002:a67:ffc3:: with SMTP id w3mr35352957vsq.6.1620897518402; Thu, 13 May 2021 02:18:38 -0700 (PDT) MIME-Version: 1.0 Date: Thu, 13 May 2021 17:23:07 +0800 Message-ID: Subject: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735] To: GCC Patches X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Hongtao Liu via Gcc-patches From: Hongtao Liu Reply-To: Hongtao Liu Cc: Jakub Jelinek Errors-To: gcc-patches-bounces@gcc.gnu.org Sender: "Gcc-patches" Hi: When __builtin_ia32_vzeroupper is called explicitly, the corresponding vzeroupper pattern does not carry any CLOBBERS or SETs before LRA, which leads to incorrect optimization in pass_reload. In order to solve this problem, this patch introduces a pre_reload splitter which adds CLOBBERS to vzeroupper's pattern, it can solve the problem in pr. At the same time, in order to optimize the low 128 bits in post_reload CSE, this patch also transforms those CLOBBERS to SETs in pass_vzeroupper. It works fine except for TARGET_64BIT_MS_ABI, under which xmm6-xmm15 are callee-saved, so even if there're no other uses of xmm6-xmm15 in the function, because of vzeroupper's pattern, pro_epilog will save and restore those registers, which is obviously redundant. In order to eliminate this redundancy, a post_reload splitter is introduced, which drops those SETs, until epilogue_completed splitter adds those SETs back, it looks to be safe since there's no CSE between post_reload split2 and epilogue_completed split3??? Also frame info needs to be updated in pro_epilog, which saves and restores xmm6-xmm15 only if there's usage other than explicit vzeroupper pattern. Bootstrapped and regtested on X86_64-linux-gnu{-m32,} Ok for trunk? gcc/ChangeLog: PR target/82735 * config/i386/i386-expand.c (ix86_expand_builtin): Count number of __builtin_ia32_vzeroupper. * config/i386/i386-features.c (ix86_add_reg_usage_to_vzerouppers): Transform CLOBBERs to SETs for explicit vzeroupper pattern so that CSE can optimize lower 128 bits. * config/i386/i386.c (ix86_handle_explicit_vzeroupper_in_pro_epilog): New. (ix86_save_reg): If there's no use of xmm6~xmm15 other than explicit vzeroupper under TARGET_64BIT_MS_ABI, no need to save REGNO. (ix86_finalize_stack_frame_flags): Recompute frame layout if there's explicit vzeroupper under TARGET_64BIT_MS_ABI. * config/i386/i386.h (struct machine_function): Change type of has_explicit_vzeroupper from BOOL_BITFILED to unsigned int. * config/i386/sse.md (*avx_vzeroupper_2): New post-reload splitter which will drop all SETs for explicit vzeroupper patterns. (*avx_vzeroupper_1): Generate SET reg to reg instead of CLOBBER, and add pre-reload splitter after it. gcc/testsuite/ChangeLog: PR target/82735 * gcc.target/i386/pr82735-1.c: New test. * gcc.target/i386/pr82735-2.c: New test. * gcc.target/i386/pr82735-3.c: New test. * gcc.target/i386/pr82735-4.c: New test. * gcc.target/i386/pr82735-5.c: New test. From d53b0c6934ea499c9f87df963661b627e7e977bf Mon Sep 17 00:00:00 2001 From: liuhongt Date: Wed, 12 May 2021 14:20:54 +0800 Subject: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. When __builtin_ia32_vzeroupper is called explicitly, the corresponding vzeroupper pattern does not carry any CLOBBERS or SETs before LRA, which leads to incorrect optimization in pass_reload. In order to solve this problem, this patch introduces a pre_reload splitter which adds CLOBBERS to vzeroupper's pattern, it can solve the problem in pr. At the same time, in order to optimize the low 128 bits in post_reload CSE, this patch also transforms those CLOBBERS to SETs in pass_vzeroupper. It works fine except for TARGET_64BIT_MS_ABI, under which xmm6-xmm15 are callee-saved, so even if there're no other uses of xmm6-xmm15 in the function, because of vzeroupper's pattern, pro_epilog will save and restore those registers, which is obviously redundant. In order to eliminate this redundancy, a post_reload splitter is introduced, which drops those SETs, until epilogue_completed splitter adds those SETs back, it looks to be safe since there's no CSE between post_reload split2 and epilogue_completed split3??? Also frame info needs to be updated in pro_epilog, which saves and restores xmm6-xmm15 only if there's usage other than explicit vzeroupper pattern. gcc/ChangeLog: PR target/82735 * config/i386/i386-expand.c (ix86_expand_builtin): Count number of __builtin_ia32_vzeroupper. * config/i386/i386-features.c (ix86_add_reg_usage_to_vzerouppers): Transform CLOBBERs to SETs for explict vzeroupper pattern so that CSE can optimize lower 128 bits. * config/i386/i386.c (ix86_handle_explicit_vzeroupper_in_pro_epilog): New. (ix86_save_reg): If there's no use of xmm6~xmm15 other than explicit vzeroupper under TARGET_64BIT_MS_ABI, no need to save REGNO. (ix86_finalize_stack_frame_flags): Recompute frame layout if there's explicit vzeroupper under TARGET_64BIT_MS_ABI. * config/i386/i386.h (struct machine_function): Change type of has_explicit_vzeroupper from BOOL_BITFILED to unsigned int. * config/i386/sse.md (*avx_vzeroupper_2): New post-reload splitter which will drop all SETs for explicit vzeroupper patterns. (*avx_vzeroupper_1): Generate SET reg to reg instead of CLOBBER, and add pre-reload splitter after it. gcc/testsuite/ChangeLog: PR target/82735 * gcc.target/i386/pr82735-1.c: New test. * gcc.target/i386/pr82735-2.c: New test. * gcc.target/i386/pr82735-3.c: New test. * gcc.target/i386/pr82735-4.c: New test. * gcc.target/i386/pr82735-5.c: New test. --- gcc/config/i386/i386-expand.c | 2 +- gcc/config/i386/i386-features.c | 25 ++++++++++- gcc/config/i386/i386.c | 23 ++++++++++ gcc/config/i386/i386.h | 8 ++-- gcc/config/i386/sse.md | 48 +++++++++++++++++++- gcc/testsuite/gcc.target/i386/pr82735-1.c | 29 ++++++++++++ gcc/testsuite/gcc.target/i386/pr82735-2.c | 21 +++++++++ gcc/testsuite/gcc.target/i386/pr82735-3.c | 5 +++ gcc/testsuite/gcc.target/i386/pr82735-4.c | 48 ++++++++++++++++++++ gcc/testsuite/gcc.target/i386/pr82735-5.c | 54 +++++++++++++++++++++++ 10 files changed, 256 insertions(+), 7 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-1.c create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-2.c create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-3.c create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-4.c create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-5.c diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c index fee4d07b7fd..7f3326a12b2 100644 --- a/gcc/config/i386/i386-expand.c +++ b/gcc/config/i386/i386-expand.c @@ -13233,7 +13233,7 @@ rdseed_step: return 0; case IX86_BUILTIN_VZEROUPPER: - cfun->machine->has_explicit_vzeroupper = true; + cfun->machine->has_explicit_vzeroupper++; break; default: diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c index 77783a154b6..6b2179f16cb 100644 --- a/gcc/config/i386/i386-features.c +++ b/gcc/config/i386/i386-features.c @@ -1827,8 +1827,31 @@ ix86_add_reg_usage_to_vzerouppers (void) { if (!NONDEBUG_INSN_P (insn)) continue; + /* Transform CLOBBERs to SETs so that lower 128 bits of sse reisters + will be able to cross vzeroupper in post-reload CSE. */ if (vzeroupper_pattern (PATTERN (insn), VOIDmode)) - ix86_add_reg_usage_to_vzeroupper (insn, live_regs); + { + if (XVECEXP (XVECEXP (PATTERN (insn), 0, 0), 0, 0) == const1_rtx) + { + unsigned int nregs = TARGET_64BIT ? 16 : 8; + rtvec vec = rtvec_alloc (nregs + 1); + RTVEC_ELT (vec, 0) = XVECEXP (PATTERN (insn), 0, 0); + for (unsigned int i = 0; i < nregs; ++i) + { + unsigned int regno = GET_SSE_REGNO (i); + rtx reg = gen_rtx_REG (V2DImode, regno); + RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg); + } + XVEC (PATTERN (insn), 0) = vec; + INSN_CODE (insn) = -1; + df_insn_rescan (insn); + } + else + { + gcc_assert (XVECLEN (PATTERN (insn), 0) == 1); + ix86_add_reg_usage_to_vzeroupper (insn, live_regs); + } + } df_simulate_one_insn_backwards (bb, insn, live_regs); } } diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 780da108a7c..4d4d7dbbc82 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -6170,6 +6170,17 @@ ix86_hard_regno_scratch_ok (unsigned int regno) && df_regs_ever_live_p (regno))); } +/* Return true if explicit usage of __builtin_ia32_vzeroupper + should be specially handled in pro_epilog. */ +static bool +ix86_handle_explicit_vzeroupper_in_pro_epilog () +{ + return (cfun->machine->has_explicit_vzeroupper + && TARGET_64BIT_MS_ABI + && !epilogue_completed + && reload_completed); +} + /* Return TRUE if we need to save REGNO. */ bool @@ -6244,6 +6255,16 @@ ix86_save_reg (unsigned int regno, bool maybe_eh_return, bool ignore_outlined) && !cfun->machine->no_drap_save_restore) return true; + /* If there's no use other than explicit vzeroupper + for xmm6~xmm15 under TARGET_64BIT_MS_ABI, + no need to save REGNO. */ + if (ix86_handle_explicit_vzeroupper_in_pro_epilog () + && (IN_RANGE (regno, FIRST_SSE_REG + 6, LAST_SSE_REG) + || IN_RANGE (regno, FIRST_REX_SSE_REG, LAST_REX_SSE_REG))) + return df_regs_ever_live_p (regno) + ? df_hard_reg_used_count (regno) > cfun->machine->has_explicit_vzeroupper + : false; + return (df_regs_ever_live_p (regno) && !call_used_or_fixed_reg_p (regno) && (regno != HARD_FRAME_POINTER_REGNUM || !frame_pointer_needed)); @@ -8046,6 +8067,8 @@ ix86_finalize_stack_frame_flags (void) recompute_frame_layout_p = true; crtl->stack_realign_needed = stack_realign; crtl->stack_realign_finalized = true; + if (ix86_handle_explicit_vzeroupper_in_pro_epilog ()) + recompute_frame_layout_p = true; if (recompute_frame_layout_p) ix86_compute_frame_layout (); } diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index 97d6f3863cb..c0855a936ac 100644 --- a/gcc/config/i386/i386.h +++ b/gcc/config/i386/i386.h @@ -2654,10 +2654,6 @@ struct GTY(()) machine_function { /* True if the function needs a stack frame. */ BOOL_BITFIELD stack_frame_required : 1; - /* True if __builtin_ia32_vzeroupper () has been expanded in current - function. */ - BOOL_BITFIELD has_explicit_vzeroupper : 1; - /* True if we should act silently, rather than raise an error for invalid calls. */ BOOL_BITFIELD silent_p : 1; @@ -2665,6 +2661,10 @@ struct GTY(()) machine_function { /* The largest alignment, in bytes, of stack slot actually used. */ unsigned int max_used_stack_alignment; + /* Number of __builtin_ia32_vzeroupper () which has been expanded in + current function. */ + unsigned int has_explicit_vzeroupper; + /* During prologue/epilogue generation, the current frame state. Otherwise, the frame state at the end of the prologue. */ struct machine_frame_state fs; diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 897cf3eaea9..489fa02fa20 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -20626,7 +20626,7 @@ (define_insn_and_split "*avx_vzeroupper_1" else { rtx reg = gen_rtx_REG (V2DImode, regno); - RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); + RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg); } } operands[0] = gen_rtx_PARALLEL (VOIDmode, vec); @@ -20638,6 +20638,52 @@ (define_insn_and_split "*avx_vzeroupper_1" (set_attr "btver2_decode" "vector") (set_attr "mode" "OI")]) +(define_split + [(match_parallel 0 "vzeroupper_pattern" + [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] + "TARGET_AVX && ix86_pre_reload_split ()" + [(match_dup 0)] +{ + /* When vzeroupper is explictly used, for LRA purpose, make it clear + the instruction kills sse registers. */ + gcc_assert (cfun->machine->has_explicit_vzeroupper); + unsigned int nregs = TARGET_64BIT ? 16 : 8; + rtvec vec = rtvec_alloc (nregs + 1); + RTVEC_ELT (vec, 0) = gen_rtx_UNSPEC_VOLATILE (VOIDmode, + gen_rtvec (1, const1_rtx), + UNSPECV_VZEROUPPER); + for (unsigned int i = 0; i < nregs; ++i) + { + unsigned int regno = GET_SSE_REGNO (i); + rtx reg = gen_rtx_REG (V2DImode, regno); + RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); + } + operands[0] = gen_rtx_PARALLEL (VOIDmode, vec); +}) + +(define_insn_and_split "*avx_vzeroupper_2" + [(match_parallel 0 "vzeroupper_pattern" + [(unspec_volatile [(const_int 1)] UNSPECV_VZEROUPPER)])] + "TARGET_AVX && XVECLEN (operands[0], 0) == (TARGET_64BIT ? 16 : 8) + 1" + "vzeroupper" + "&& reload_completed && TARGET_64BIT_MS_ABI" + [(const_int 0)] +{ + /* To avoid redundant save and restore in pro_and_epilog, drop + those SETs/CLOBBERs which are added by pre-reload splitter + or pass_vzeroupper, it's safe since there's no CSE optimization + between post-reload split2 and epilogue-completed split3??? */ + gcc_assert (cfun->machine->has_explicit_vzeroupper); + emit_insn (gen_avx_vzeroupper ()); + DONE; +} + [(set_attr "type" "sse") + (set_attr "modrm" "0") + (set_attr "memory" "none") + (set_attr "prefix" "vex") + (set_attr "btver2_decode" "vector") + (set_attr "mode" "OI")]) + (define_mode_attr pbroadcast_evex_isa [(V64QI "avx512bw") (V32QI "avx512bw") (V16QI "avx512bw") (V32HI "avx512bw") (V16HI "avx512bw") (V8HI "avx512bw") diff --git a/gcc/testsuite/gcc.target/i386/pr82735-1.c b/gcc/testsuite/gcc.target/i386/pr82735-1.c new file mode 100644 index 00000000000..1a63b9ae9c9 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr82735-1.c @@ -0,0 +1,29 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -mavx" } */ +/* { dg-require-effective-target avx } */ + +#include "avx-check.h" + +void +__attribute__ ((noipa)) +mtest(char *dest) +{ + __m256i ymm1 = _mm256_set1_epi8((char)0x1); + _mm256_storeu_si256((__m256i *)(dest + 32), ymm1); + _mm256_zeroupper(); + __m256i ymm2 = _mm256_set1_epi8((char)0x1); + _mm256_storeu_si256((__m256i *)dest, ymm2); +} + +void +avx_test () +{ + char buf[64]; + for (int i = 0; i != 64; i++) + buf[i] = 2; + mtest (buf); + + for (int i = 0; i < 32; ++i) + if (buf[i] != 1) + __builtin_abort (); +} diff --git a/gcc/testsuite/gcc.target/i386/pr82735-2.c b/gcc/testsuite/gcc.target/i386/pr82735-2.c new file mode 100644 index 00000000000..48d0d6e983d --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr82735-2.c @@ -0,0 +1,21 @@ +/* { dg-do compile } */ +/* { dg-options "-mavx -O2" } */ + +#include + +void test(char *dest) +{ + /* xmm1 can be propagated to xmm2 by CSE. */ + __m128i xmm1 = _mm_set1_epi8((char)0x1); + _mm_storeu_si128((__m128i *)(dest + 32), xmm1); + _mm256_zeroupper(); + _mm256_zeroupper(); + _mm256_zeroupper(); + _mm256_zeroupper(); + _mm256_zeroupper(); + __m128i xmm2 = _mm_set1_epi8((char)0x1); + _mm_storeu_si128((__m128i *)dest, xmm2); +} + +/* Darwin local constant symbol is "lC0", ELF targets ".LC0" */ +/* { dg-final { scan-assembler-times {(?n)vmovdqa\t\.?[Ll]C0[^,]*, %xmm[0-9]} 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr82735-3.c b/gcc/testsuite/gcc.target/i386/pr82735-3.c new file mode 100644 index 00000000000..e3f801e6924 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr82735-3.c @@ -0,0 +1,5 @@ +/* { dg-do compile } */ +/* { dg-options "-mavx -O2 -mabi=ms" } */ +/* { dg-final { scan-assembler-not {(?n)xmm([6-9]|1[0-5])} } } */ + +#include "pr82735-2.c" diff --git a/gcc/testsuite/gcc.target/i386/pr82735-4.c b/gcc/testsuite/gcc.target/i386/pr82735-4.c new file mode 100644 index 00000000000..78c0a6cb2c8 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr82735-4.c @@ -0,0 +1,48 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */ +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */ +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */ + +#include + +void test(char *dest) +{ + __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15; + asm volatile ("vmovdqa\t%%ymm0, %0\n\t" + "vmovdqa\t%%ymm0, %1\n\t" + "vmovdqa\t%%ymm0, %2\n\t" + "vmovdqa\t%%ymm0, %3\n\t" + "vmovdqa\t%%ymm0, %4\n\t" + "vmovdqa\t%%ymm0, %5\n\t" + "vmovdqa\t%%ymm0, %6\n\t" + "vmovdqa\t%%ymm0, %7\n\t" + "vmovdqa\t%%ymm0, %8\n\t" + "vmovdqa\t%%ymm0, %9\n\t" + "vmovdqa\t%%ymm0, %10\n\t" + "vmovdqa\t%%ymm0, %11\n\t" + "vmovdqa\t%%ymm0, %12\n\t" + "vmovdqa\t%%ymm0, %13\n\t" + "vmovdqa\t%%ymm0, %14\n\t" + "vmovdqa\t%%ymm0, %15\n\t" + : "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5), + "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10), + "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15), + "=v"(ymm0) + ::); + _mm256_zeroupper(); + _mm256_storeu_si256((__m256i *)dest, ymm1); + _mm256_storeu_si256((__m256i *)(dest + 32), ymm2); + _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3); + _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4); + _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5); + _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6); + _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7); + _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8); + _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9); + _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10); + _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11); + _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12); + _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13); + _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14); + _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15); +} diff --git a/gcc/testsuite/gcc.target/i386/pr82735-5.c b/gcc/testsuite/gcc.target/i386/pr82735-5.c new file mode 100644 index 00000000000..2a58cbe52d0 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr82735-5.c @@ -0,0 +1,54 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */ +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */ +/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */ + +#include + +void test(char *dest) +{ + __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15; + asm volatile ("vmovdqa\t%%ymm0, %0\n\t" + "vmovdqa\t%%ymm0, %1\n\t" + "vmovdqa\t%%ymm0, %2\n\t" + "vmovdqa\t%%ymm0, %3\n\t" + "vmovdqa\t%%ymm0, %4\n\t" + "vmovdqa\t%%ymm0, %5\n\t" + "vmovdqa\t%%ymm0, %6\n\t" + "vmovdqa\t%%ymm0, %7\n\t" + "vmovdqa\t%%ymm0, %8\n\t" + "vmovdqa\t%%ymm0, %9\n\t" + "vmovdqa\t%%ymm0, %10\n\t" + "vmovdqa\t%%ymm0, %11\n\t" + "vmovdqa\t%%ymm0, %12\n\t" + "vmovdqa\t%%ymm0, %13\n\t" + "vmovdqa\t%%ymm0, %14\n\t" + "vmovdqa\t%%ymm0, %15\n\t" + : "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5), + "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10), + "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15), + "=v"(ymm0) + ::); + _mm256_zeroupper(); + _mm256_zeroupper(); + _mm256_zeroupper(); + _mm256_zeroupper(); + _mm256_zeroupper(); + _mm256_zeroupper(); + _mm256_zeroupper(); + _mm256_storeu_si256((__m256i *)dest, ymm1); + _mm256_storeu_si256((__m256i *)(dest + 32), ymm2); + _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3); + _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4); + _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5); + _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6); + _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7); + _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8); + _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9); + _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10); + _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11); + _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12); + _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13); + _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14); + _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15); +} -- 2.18.1