From patchwork Wed Sep 11 02:16:37 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: liuhongt X-Patchwork-Id: 1983606 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.a=rsa-sha256 header.s=Intel header.b=lqIFGR6N; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4X3PNC1pL1z1y1y for ; Wed, 11 Sep 2024 12:17:13 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id C80CC3857C4F for ; Wed, 11 Sep 2024 02:17:10 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.21]) by sourceware.org (Postfix) with ESMTPS id 1297C3858C41 for ; Wed, 11 Sep 2024 02:16:40 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 1297C3858C41 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=intel.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 1297C3858C41 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=198.175.65.21 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1726021006; cv=none; b=LrnqBn0SeREsblFJKgHIsIkC1qqCYgW2P3TvZb50lg0CECkhrKcULiqPrc6pkV3pBqHlo5CRAHI1rzvSaWpT/pW+04w7lP/P/GdgbjpYR1mdbD/G282cFj58N4b5f2ihL8ox0FA+VMlm0Uyttlpz+Sao0IZRqhNrvB8zcBjZGrA= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1726021006; c=relaxed/simple; bh=QGd3IJ+jmK3bAhymfT9wko+lTLXrtPW2WG1gjYzf7X0=; h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version; b=aexKbEhuXPstrZbUZrT1EA0T0XIzUXziKkmouliA/8sHi8e97Dl+OtogT9HSdySHozXPcJLMb+geQ7SPsiDtTXwyUmmbMA/UOhGtVkCtMpLvfzf18wc5MmGMuT8kGoW2di5fSD9Xkchv0ISk0l1oGLy9QH0tzR5hlRzkCq45BUU= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1726021002; x=1757557002; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=QGd3IJ+jmK3bAhymfT9wko+lTLXrtPW2WG1gjYzf7X0=; b=lqIFGR6Nlpvbn8Lz6TYcqgqpDA/joj8HuwLZnlaB4VoLYGdGwEtcYH8v uPleXuLDbMP/jMrCqglenuUFZ9tK2ViCRslPTOO57UImRWFw+cXEYBULJ Nv3Sy4ojPvp+i9JiMW3EqxK7bSE+MCmLTl+12ElbfrdiG6+EMWaKEf17r 08HnapDqEdiqatlcs8HgFFPYrHnRIjK0r/lbKU9Jrzjapja4XwdxWnFP2 tgwi9utPIg3iS42moac8ue9gtYTwxMkHXpJ7k0dyLv0PyngSa7IBeTQeT NA4f2PaFBd4PUC+nWgGtu1aOYfI5px9p8rStSnNPPMAA+m8720Ibwo/FT g==; X-CSE-ConnectionGUID: iQPZydz2SC6xlmKaX0ZJXg== X-CSE-MsgGUID: SBFr7FS0TWamOM032xeANg== X-IronPort-AV: E=McAfee;i="6700,10204,11191"; a="24738712" X-IronPort-AV: E=Sophos;i="6.10,218,1719903600"; d="scan'208";a="24738712" Received: from orviesa010.jf.intel.com ([10.64.159.150]) by orvoesa113.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Sep 2024 19:16:40 -0700 X-CSE-ConnectionGUID: JUZmKLllSuyfbNnHv2R2fw== X-CSE-MsgGUID: YJ8b5BLmSv6sHBxWhIIFsw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.10,218,1719903600"; d="scan'208";a="67064128" Received: from shliclel4217.sh.intel.com ([10.239.240.127]) by orviesa010.jf.intel.com with ESMTP; 10 Sep 2024 19:16:38 -0700 From: liuhongt To: gcc-patches@gcc.gnu.org Cc: crazylht@gmail.com, hjl.tools@gmail.com Subject: [RFC PATCH] Enable vectorization for unknown tripcount in very cheap cost model but disable epilog vectorization. Date: Wed, 11 Sep 2024 10:16:37 +0800 Message-Id: <20240911021637.3759883-1-hongtao.liu@intel.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org GCC12 enables vectorization for O2 with very cheap cost model which is restricted to constant tripcount. The vectorization capacity is very limited w/ consideration of codesize impact. The patch extends the very cheap cost model a little bit to support variable tripcount. But still disable peeling for gaps/alignment, runtime aliasing checking and epilogue vectorization with the consideration of codesize. So there're at most 2 versions of loop for O2 vectorization, one vectorized main loop , one scalar/remainder loop. .i.e. void foo1 (int* __restrict a, int* b, int* c, int n) { for (int i = 0; i != n; i++) a[i] = b[i] + c[i]; } with -O2 -march=x86-64-v3, will be vectorized to .L10: vmovdqu (%r8,%rax), %ymm0 vpaddd (%rsi,%rax), %ymm0, %ymm0 vmovdqu %ymm0, (%rdi,%rax) addq $32, %rax cmpq %rdx, %rax jne .L10 movl %ecx, %eax andl $-8, %eax cmpl %eax, %ecx je .L21 vzeroupper .L12: movl (%r8,%rax,4), %edx addl (%rsi,%rax,4), %edx movl %edx, (%rdi,%rax,4) addq $1, %rax cmpl %eax, %ecx jne .L12 As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 4.11% with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% with extra 8.88% codesize. The details are as below Performance measured with -march=x86-64-v3 -O2 on EMR N-Iter cheap cost model 500.perlbench_r -0.12% -0.12% 502.gcc_r 0.44% -0.11% 505.mcf_r 0.17% 4.46% 520.omnetpp_r 0.28% -0.27% 523.xalancbmk_r 0.00% 5.93% 525.x264_r -0.09% 23.53% 531.deepsjeng_r 0.19% 0.00% 541.leela_r 0.22% 0.00% 548.exchange2_r -11.54% -22.34% 557.xz_r 0.74% 0.49% GEOMEAN INT -1.04% 0.60% 503.bwaves_r 3.13% 4.72% 507.cactuBSSN_r 1.17% 0.29% 508.namd_r 0.39% 6.87% 510.parest_r 3.14% 8.52% 511.povray_r 0.10% -0.20% 519.lbm_r -0.68% 10.14% 521.wrf_r 68.20% 76.73% 526.blender_r 0.12% 0.12% 527.cam4_r 19.67% 23.21% 538.imagick_r 0.12% 0.24% 544.nab_r 0.63% 0.53% 549.fotonik3d_r 14.44% 9.43% 554.roms_r 12.39% 0.00% GEOMEAN FP 8.26% 9.41% GEOMEAN ALL 4.11% 5.74% Code sise impact N-Iter cheap cost model 500.perlbench_r 0.22% 1.03% 502.gcc_r 0.25% 0.60% 505.mcf_r 0.00% 32.07% 520.omnetpp_r 0.09% 0.31% 523.xalancbmk_r 0.08% 1.86% 525.x264_r 0.75% 7.96% 531.deepsjeng_r 0.72% 3.28% 541.leela_r 0.18% 0.75% 548.exchange2_r 8.29% 12.19% 557.xz_r 0.40% 0.60% GEOMEAN INT 1.07%% 5.71% 503.bwaves_r 12.89% 21.59% 507.cactuBSSN_r 0.90% 20.19% 508.namd_r 0.77% 14.75% 510.parest_r 0.91% 3.91% 511.povray_r 0.45% 4.08% 519.lbm_r 0.00% 0.00% 521.wrf_r 5.97% 12.79% 526.blender_r 0.49% 3.84% 527.cam4_r 1.39% 3.28% 538.imagick_r 1.86% 7.78% 544.nab_r 0.41% 3.00% 549.fotonik3d_r 25.50% 47.47% 554.roms_r 5.17% 13.01% GEOMEAN FP 4.14% 11.38% GEOMEAN ALL 2.80% 8.88% The only regression is from 548.exchange_r, the vectorization for inner loop in each layer of the 9-layer loops increases register pressure and causes more spill. - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10 - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10 ..... - block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10 ... - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10 - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10 Looks like aarch64 doesn't have the issue because aarch64 has 32 gprs, but x86 only has 16. I have a extra patch to prevent loop vectorization in deep-depth loop for x86 backend which can bring the performance back. For 503.bwaves_r/505.mcf_r/507.cactuBSSN_r/508.namd_r, cheap cost model increases codesize a lot but don't imporve any performance. And N-iter is much better for that for codesize. Any comments? gcc/ChangeLog: * tree-vect-loop.cc (vect_analyze_loop_costing): Enable vectorization for LOOP_VINFO_PEELING_FOR_NITER in very cheap cost model. (vect_analyze_loop): Disable epilogue vectorization in very cheap cost model. --- gcc/tree-vect-loop.cc | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 242d5e2d916..06afd8cae79 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -2356,8 +2356,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo, a copy of the scalar code (even if we might be able to vectorize it). */ if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) - || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) - || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))) + || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, @@ -3638,7 +3637,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) /* No code motion support for multiple epilogues so for now not supported when multiple exits. */ && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo) - && !loop->simduid); + && !loop->simduid + && loop_cost_model (loop) > VECT_COST_MODEL_VERY_CHEAP); if (!vect_epilogues) return first_loop_vinfo;