From patchwork Wed Sep 11 02:16:37 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: liuhongt <hongtao.liu@intel.com>
X-Patchwork-Id: 1983606
Return-Path: <gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (2048-bit key;
 unprotected) header.d=intel.com header.i=@intel.com header.a=rsa-sha256
 header.s=Intel header.b=lqIFGR6N;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4X3PNC1pL1z1y1y
	for <incoming@patchwork.ozlabs.org>; Wed, 11 Sep 2024 12:17:13 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id C80CC3857C4F
	for <incoming@patchwork.ozlabs.org>; Wed, 11 Sep 2024 02:17:10 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.21])
 by sourceware.org (Postfix) with ESMTPS id 1297C3858C41
 for <gcc-patches@gcc.gnu.org>; Wed, 11 Sep 2024 02:16:40 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 1297C3858C41
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=intel.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 1297C3858C41
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=198.175.65.21
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1726021006; cv=none;
 b=LrnqBn0SeREsblFJKgHIsIkC1qqCYgW2P3TvZb50lg0CECkhrKcULiqPrc6pkV3pBqHlo5CRAHI1rzvSaWpT/pW+04w7lP/P/GdgbjpYR1mdbD/G282cFj58N4b5f2ihL8ox0FA+VMlm0Uyttlpz+Sao0IZRqhNrvB8zcBjZGrA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1726021006; c=relaxed/simple;
 bh=QGd3IJ+jmK3bAhymfT9wko+lTLXrtPW2WG1gjYzf7X0=;
 h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version;
 b=aexKbEhuXPstrZbUZrT1EA0T0XIzUXziKkmouliA/8sHi8e97Dl+OtogT9HSdySHozXPcJLMb+geQ7SPsiDtTXwyUmmbMA/UOhGtVkCtMpLvfzf18wc5MmGMuT8kGoW2di5fSD9Xkchv0ISk0l1oGLy9QH0tzR5hlRzkCq45BUU=
ARC-Authentication-Results: i=1; server2.sourceware.org
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
 d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
 t=1726021002; x=1757557002;
 h=from:to:cc:subject:date:message-id:mime-version:
 content-transfer-encoding;
 bh=QGd3IJ+jmK3bAhymfT9wko+lTLXrtPW2WG1gjYzf7X0=;
 b=lqIFGR6Nlpvbn8Lz6TYcqgqpDA/joj8HuwLZnlaB4VoLYGdGwEtcYH8v
 uPleXuLDbMP/jMrCqglenuUFZ9tK2ViCRslPTOO57UImRWFw+cXEYBULJ
 Nv3Sy4ojPvp+i9JiMW3EqxK7bSE+MCmLTl+12ElbfrdiG6+EMWaKEf17r
 08HnapDqEdiqatlcs8HgFFPYrHnRIjK0r/lbKU9Jrzjapja4XwdxWnFP2
 tgwi9utPIg3iS42moac8ue9gtYTwxMkHXpJ7k0dyLv0PyngSa7IBeTQeT
 NA4f2PaFBd4PUC+nWgGtu1aOYfI5px9p8rStSnNPPMAA+m8720Ibwo/FT g==;
X-CSE-ConnectionGUID: iQPZydz2SC6xlmKaX0ZJXg==
X-CSE-MsgGUID: SBFr7FS0TWamOM032xeANg==
X-IronPort-AV: E=McAfee;i="6700,10204,11191"; a="24738712"
X-IronPort-AV: E=Sophos;i="6.10,218,1719903600"; d="scan'208";a="24738712"
Received: from orviesa010.jf.intel.com ([10.64.159.150])
 by orvoesa113.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 10 Sep 2024 19:16:40 -0700
X-CSE-ConnectionGUID: JUZmKLllSuyfbNnHv2R2fw==
X-CSE-MsgGUID: YJ8b5BLmSv6sHBxWhIIFsw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.10,218,1719903600"; d="scan'208";a="67064128"
Received: from shliclel4217.sh.intel.com ([10.239.240.127])
 by orviesa010.jf.intel.com with ESMTP; 10 Sep 2024 19:16:38 -0700
From: liuhongt <hongtao.liu@intel.com>
To: gcc-patches@gcc.gnu.org
Cc: crazylht@gmail.com,
	hjl.tools@gmail.com
Subject: [RFC PATCH] Enable vectorization for unknown tripcount in very cheap
 cost model but disable epilog vectorization.
Date: Wed, 11 Sep 2024 10:16:37 +0800
Message-Id: <20240911021637.3759883-1-hongtao.liu@intel.com>
X-Mailer: git-send-email 2.31.1
MIME-Version: 1.0
X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org

GCC12 enables vectorization for O2 with very cheap cost model which is restricted
to constant tripcount. The vectorization capacity is very limited w/ consideration
of codesize impact.

The patch extends the very cheap cost model a little bit to support variable tripcount.
But still disable peeling for gaps/alignment, runtime aliasing checking and epilogue
vectorization with the consideration of codesize.

So there're at most 2 versions of loop for O2 vectorization, one vectorized main loop
, one scalar/remainder loop.

.i.e.

void
foo1 (int* __restrict a, int* b, int* c, int n)
{
 for (int i = 0; i != n; i++)
  a[i] = b[i] + c[i];
}

with -O2 -march=x86-64-v3, will be vectorized to

.L10:
        vmovdqu (%r8,%rax), %ymm0
        vpaddd  (%rsi,%rax), %ymm0, %ymm0
        vmovdqu %ymm0, (%rdi,%rax)
        addq    $32, %rax
        cmpq    %rdx, %rax
        jne     .L10
        movl    %ecx, %eax
        andl    $-8, %eax
        cmpl    %eax, %ecx
        je      .L21
        vzeroupper
.L12:
        movl    (%r8,%rax,4), %edx
        addl    (%rsi,%rax,4), %edx
        movl    %edx, (%rdi,%rax,4)
        addq    $1, %rax
        cmpl    %eax, %ecx
        jne     .L12

As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 4.11%
with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% with
extra 8.88% codesize. The details are as below

Performance measured with -march=x86-64-v3 -O2 on EMR

    	     	    N-Iter	cheap cost model
500.perlbench_r	    -0.12%	-0.12%
502.gcc_r	    0.44%	-0.11%	
505.mcf_r	    0.17%	4.46%
520.omnetpp_r	    0.28%	-0.27%
523.xalancbmk_r	    0.00%	5.93%
525.x264_r	    -0.09%	23.53%
531.deepsjeng_r	    0.19%	0.00%
541.leela_r	    0.22%	0.00%
548.exchange2_r	    -11.54%	-22.34%
557.xz_r	    0.74%	0.49%
GEOMEAN INT	    -1.04%	0.60%

503.bwaves_r	    3.13%	4.72%
507.cactuBSSN_r	    1.17%	0.29%
508.namd_r	    0.39%	6.87%
510.parest_r	    3.14%	8.52%
511.povray_r	    0.10%	-0.20%
519.lbm_r	    -0.68%	10.14%
521.wrf_r	    68.20%	76.73%
526.blender_r	    0.12%	0.12%
527.cam4_r	    19.67%	23.21%
538.imagick_r	    0.12%	0.24%
544.nab_r	    0.63%	0.53%
549.fotonik3d_r	    14.44%	9.43%
554.roms_r	    12.39%	0.00%
GEOMEAN FP	    8.26%	9.41%
GEOMEAN ALL	    4.11%	5.74%

Code sise impact
    	     	    N-Iter	cheap cost model
500.perlbench_r	    0.22%	1.03%
502.gcc_r	    0.25%	0.60%	
505.mcf_r	    0.00%	32.07%
520.omnetpp_r	    0.09%	0.31%
523.xalancbmk_r	    0.08%	1.86%
525.x264_r	    0.75%	7.96%
531.deepsjeng_r	    0.72%	3.28%
541.leela_r	    0.18%	0.75%
548.exchange2_r	    8.29%	12.19%
557.xz_r	    0.40%	0.60%
GEOMEAN INT	    1.07%%	5.71%

503.bwaves_r	    12.89%	21.59%
507.cactuBSSN_r	    0.90%	20.19%
508.namd_r	    0.77%	14.75%
510.parest_r	    0.91%	3.91%
511.povray_r	    0.45%	4.08%
519.lbm_r	    0.00%	0.00%
521.wrf_r	    5.97%	12.79%
526.blender_r	    0.49%	3.84%
527.cam4_r	    1.39%	3.28%
538.imagick_r	    1.86%	7.78%
544.nab_r	    0.41%	3.00%
549.fotonik3d_r	    25.50%	47.47%
554.roms_r	    5.17%	13.01%
GEOMEAN FP	    4.14%	11.38%
GEOMEAN ALL	    2.80%	8.88%


The only regression is from 548.exchange_r, the vectorization for inner loop in each layer
of the 9-layer loops increases register pressure and causes more spill.
- block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
  - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
    .....
	- block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10
    ...
- block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
- block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10

Looks like aarch64 doesn't have the issue because aarch64 has 32 gprs, but x86 only has 16.
I have a extra patch to prevent loop vectorization in deep-depth loop for x86 backend which can
bring the performance back.

For 503.bwaves_r/505.mcf_r/507.cactuBSSN_r/508.namd_r, cheap cost model increases codesize
a lot but don't imporve any performance. And N-iter is much better for that for codesize.


Any comments?


gcc/ChangeLog:

	* tree-vect-loop.cc (vect_analyze_loop_costing): Enable
	vectorization for LOOP_VINFO_PEELING_FOR_NITER in very cheap
	cost model.
	(vect_analyze_loop): Disable epilogue vectorization in very
	cheap cost model.
---
 gcc/tree-vect-loop.cc | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 242d5e2d916..06afd8cae79 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -2356,8 +2356,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
      a copy of the scalar code (even if we might be able to vectorize it).  */
   if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
       && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
-	  || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
-	  || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
+	  || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)))
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -3638,7 +3637,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 			   /* No code motion support for multiple epilogues so for now
 			      not supported when multiple exits.  */
 			 && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo)
-			 && !loop->simduid);
+			 && !loop->simduid
+			 && loop_cost_model (loop) > VECT_COST_MODEL_VERY_CHEAP);
   if (!vect_epilogues)
     return first_loop_vinfo;