From patchwork Thu Dec 14 07:54:02 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hongyu Wang X-Patchwork-Id: 1876055 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.a=rsa-sha256 header.s=Intel header.b=D3ftgpsR; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=8.43.85.97; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4SrPkj3dCzz23nm for ; Thu, 14 Dec 2023 18:54:21 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 7DC1F3861860 for ; Thu, 14 Dec 2023 07:54:19 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.88]) by sourceware.org (Postfix) with ESMTPS id C60A03858D20 for ; Thu, 14 Dec 2023 07:54:06 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C60A03858D20 Authentication-Results: sourceware.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: sourceware.org; spf=fail smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org C60A03858D20 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=192.55.52.88 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702540448; cv=none; b=l5HKNYTU8bcD9PvFXG8r9oRJNOyabHLT0kG8DWkRCpbADT+Lqo2ddN2JS4Z7iG2Nr7nj6wjmy252m1WPP2AaI17fHGBTJ1ec3vjM7gEh3j/xG0QPknfAaBeyCWRASwXjrQ3ba9RiY2tzmlN1Gq0f3f6GglPGe7orYLTVlwjcrsM= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702540448; c=relaxed/simple; bh=EYG6o2S0jf4vL+aGsQv45/B3MvidJYKN42dt8RiNjuU=; h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version; b=JCkW1suhZWHVaeBvy6/SiJwmjWDpF+s1WLNk+hJ4cJFSXFEdjAWI7S6xZu/qgiPqSWTDAtjx/CW85Z8Kt698CR+TheNCB3THSVpLDgVEB7EJpp5hiF8Rvd05roVrg68997ympQCISG8pho8vKEykooN8GNbyQwB6D07jzoNQxMc= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1702540446; x=1734076446; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=EYG6o2S0jf4vL+aGsQv45/B3MvidJYKN42dt8RiNjuU=; b=D3ftgpsRZoh/JtxtRwI+0MpAAUGDIvodPh4JNm74Xb+fk9m8x6oU9Ujl 3KQMu1GGB71oR2CC7bb9uuAZLqjcL+quG99V34atE9AL9Yu37EtluBgyi Q+qNMZOFdLALs8jISv58BltYLqa0Bv2qWJ63HL7ZnicrpWm/+hgE8Btj5 uAcQP3/5B5OfDFO48/lmVeI4imQ1sBN1oWkSD6AhtGXOZgoI8J/LcN6Lm bOJue9Mi8q76rObNIWP8LQiT+Zcgvmxlt+qdHpUFewtSocnfKuYmnZw1f +Xrh+QD+YXj43yVoczrybr+1uFMXxKd5yVyWMfgwvPHQ0IwG8MOpt8QWY g==; X-IronPort-AV: E=McAfee;i="6600,9927,10923"; a="426216237" X-IronPort-AV: E=Sophos;i="6.04,274,1695711600"; d="scan'208";a="426216237" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Dec 2023 23:54:05 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10923"; a="803184323" X-IronPort-AV: E=Sophos;i="6.04,274,1695711600"; d="scan'208";a="803184323" Received: from shvmail03.sh.intel.com ([10.239.245.20]) by orsmga008.jf.intel.com with ESMTP; 13 Dec 2023 23:54:02 -0800 Received: from shliclel4217.sh.intel.com (shliclel4217.sh.intel.com [10.239.240.127]) by shvmail03.sh.intel.com (Postfix) with ESMTP id 44217100568D; Thu, 14 Dec 2023 15:54:02 +0800 (CST) From: Hongyu Wang To: gcc-patches@gcc.gnu.org Cc: hjl.tools@gmail.com, hongtao.liu@intel.com Subject: [PATCH] i386: Sync move_max/store_max with prefer-vector-width [PR112824] Date: Thu, 14 Dec 2023 15:54:02 +0800 Message-Id: <20231214075402.464671-1-hongyu.wang@intel.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 X-Spam-Status: No, score=-10.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM, GIT_PATCH_0, HEADER_FROM_DIFFERENT_DOMAINS, KAM_SHORT, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_SOFTFAIL, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Hi, Currently move_max follows the tuning feature first, but ideally it should sync with prefer-vector-width when it is explicitly set to keep vector move and operation with same vector size. Bootstrapped/regtested on x86-64-pc-linux-gnu{-m32,} OK for trunk? gcc/ChangeLog: PR target/112824 * config/i386/i386-options.cc (ix86_option_override_internal): Sync ix86_move_max/ix86_store_max with prefer_vector_width when it is explicitly set. gcc/testsuite/ChangeLog: PR target/112824 * gcc.target/i386/pieces-memset-45.c: Remove -mprefer-vector-width=256. * g++.target/i386/pr112824-1.C: New test. --- gcc/config/i386/i386-options.cc | 8 +- gcc/testsuite/g++.target/i386/pr112824-1.C | 113 ++++++++++++++++++ .../gcc.target/i386/pieces-memset-45.c | 2 +- 3 files changed, 120 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/g++.target/i386/pr112824-1.C diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc index 588a0878c0d..440ef59ffff 100644 --- a/gcc/config/i386/i386-options.cc +++ b/gcc/config/i386/i386-options.cc @@ -3012,7 +3012,9 @@ ix86_option_override_internal (bool main_args_p, { /* Set the maximum number of bits can be moved from memory to memory efficiently. */ - if (ix86_tune_features[X86_TUNE_AVX512_MOVE_BY_PIECES]) + if (opts_set->x_prefer_vector_width_type != PVW_NONE) + opts->x_ix86_move_max = opts->x_prefer_vector_width_type; + else if (ix86_tune_features[X86_TUNE_AVX512_MOVE_BY_PIECES]) opts->x_ix86_move_max = PVW_AVX512; else if (ix86_tune_features[X86_TUNE_AVX256_MOVE_BY_PIECES]) opts->x_ix86_move_max = PVW_AVX256; @@ -3034,7 +3036,9 @@ ix86_option_override_internal (bool main_args_p, { /* Set the maximum number of bits can be stored to memory efficiently. */ - if (ix86_tune_features[X86_TUNE_AVX512_STORE_BY_PIECES]) + if (opts_set->x_prefer_vector_width_type != PVW_NONE) + opts->x_ix86_store_max = opts->x_prefer_vector_width_type; + else if (ix86_tune_features[X86_TUNE_AVX512_STORE_BY_PIECES]) opts->x_ix86_store_max = PVW_AVX512; else if (ix86_tune_features[X86_TUNE_AVX256_STORE_BY_PIECES]) opts->x_ix86_store_max = PVW_AVX256; diff --git a/gcc/testsuite/g++.target/i386/pr112824-1.C b/gcc/testsuite/g++.target/i386/pr112824-1.C new file mode 100644 index 00000000000..fccaf23c530 --- /dev/null +++ b/gcc/testsuite/g++.target/i386/pr112824-1.C @@ -0,0 +1,113 @@ +/* PR target/112824 */ +/* { dg-do compile } */ +/* { dg-options "-std=c++23 -O3 -march=skylake-avx512 -mprefer-vector-width=512" } */ +/* { dg-final { scan-assembler-not "vmov(?:dqu|apd)\[ \\t\]+\[^\n\]*%ymm" } } */ + + +#include +#include +#include +#include + +template +using Vec [[gnu::vector_size(W * sizeof(T))]] = T; + +// Omitted: 16 without AVX, 32 without AVX512F, +// or for forward compatibility some AVX10 may also mean 32-only +static constexpr ptrdiff_t VectorBytes = 64; +template +static constexpr ptrdiff_t VecWidth = 64 <= sizeof(T) ? 1 : 64/sizeof(T); + +template struct Vector{ + static constexpr ptrdiff_t L = N; + T data[L]; + static constexpr auto size()->ptrdiff_t{return N;} +}; +template struct Vector{ + static constexpr ptrdiff_t W = N >= VecWidth ? VecWidth : ptrdiff_t(std::bit_ceil(size_t(N))); + static constexpr ptrdiff_t L = (N/W) + ((N%W)!=0); + using V = Vec; + V data[L]; + static constexpr auto size()->ptrdiff_t{return N;} +}; +/// should be trivially copyable +/// codegen is worse when passing by value, even though it seems like it should make +/// aliasing simpler to analyze? +template +[[gnu::always_inline]] constexpr auto operator+(Vector x, Vector y) -> Vector { + Vector z; + for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x.data[n] + y.data[n]; + return z; +} +template +[[gnu::always_inline]] constexpr auto operator*(Vector x, Vector y) -> Vector { + Vector z; + for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x.data[n] * y.data[n]; + return z; +} +template +[[gnu::always_inline]] constexpr auto operator+(T x, Vector y) -> Vector { + Vector z; + for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x + y.data[n]; + return z; +} +template +[[gnu::always_inline]] constexpr auto operator*(T x, Vector y) -> Vector { + Vector z; + for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x * y.data[n]; + return z; +} + + + +template struct Dual { + T value; + Vector partials; +}; +// Here we have a specialization for non-power-of-2 `N` +template +requires(std::floating_point && (std::popcount(size_t(N))>1)) +struct Dual { + Vector data; +}; + +template +consteval auto firstoff(){ + static_assert(std::same_as, "type not implemented"); + if constexpr (W==2) return Vec<2,int64_t>{0,1} != 0; + else if constexpr (W == 4) return Vec<4,int64_t>{0,1,2,3} != 0; + else if constexpr (W == 8) return Vec<8,int64_t>{0,1,2,3,4,5,6,7} != 0; + else static_assert(false, "vector width not implemented"); +} + +template +[[gnu::always_inline]] constexpr auto operator+(Dual a, + Dual b) + -> Dual { + if constexpr (std::floating_point && (std::popcount(size_t(N))>1)){ + Dual c; + for (ptrdiff_t l = 0; l < Vector::L; ++l) + c.data.data[l] = a.data.data[l] + b.data.data[l]; + return c; + } else return {a.value + b.value, a.partials + b.partials}; +} + +template +[[gnu::always_inline]] constexpr auto operator*(Dual a, + Dual b) + -> Dual { + if constexpr (std::floating_point && (std::popcount(size_t(N))>1)){ + using V = typename Vector::V; + V va = V{}+a.data.data[0][0], vb = V{}+b.data.data[0][0]; + V x = va * b.data.data[0]; + Dual c; + c.data.data[0] = firstoff::W,T>() ? x + vb*a.data.data[0] : x; + for (ptrdiff_t l = 1; l < Vector::L; ++l) + c.data.data[l] = va*b.data.data[l] + vb*a.data.data[l]; + return c; + } else return {a.value * b.value, a.value * b.partials + b.value * a.partials}; +} + +void prod(Dual,2> &c, const Dual,2> &a, const Dual,2>&b){ + c = a*b; +} diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-45.c b/gcc/testsuite/gcc.target/i386/pieces-memset-45.c index 70c80e5064b..e8ce7c23256 100644 --- a/gcc/testsuite/gcc.target/i386/pieces-memset-45.c +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-45.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -march=x86-64 -mprefer-vector-width=256 -mavx512f -mtune-ctrl=avx512_store_by_pieces" } */ +/* { dg-options "-O2 -march=x86-64 -mavx512f -mtune-ctrl=avx512_store_by_pieces" } */ extern char *dst;