From patchwork Wed May 15 03:04:29 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Jiang, Haochen" <haochen.jiang@intel.com>
X-Patchwork-Id: 1935256
Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (2048-bit key;
 unprotected) header.d=intel.com header.i=@intel.com header.a=rsa-sha256
 header.s=Intel header.b=PjnwOT6/;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4VfJ4H4XDzz20dB
	for <incoming@patchwork.ozlabs.org>; Wed, 15 May 2024 13:05:03 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id C4E9F3846422
	for <incoming@patchwork.ozlabs.org>; Wed, 15 May 2024 03:05:01 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.11])
 by sourceware.org (Postfix) with ESMTPS id 8D8F13858D35
 for <gcc-patches@gcc.gnu.org>; Wed, 15 May 2024 03:04:35 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8D8F13858D35
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=intel.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 8D8F13858D35
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=192.198.163.11
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1715742278; cv=none;
 b=kSK3d1tbCZVCWEnw//JfhOI8cdLD6upzaZMGCR18cc0dYEj+QKVtF9xql9lOMBRg8HfHhgbO9J9OQhhk2s7QIgQqSvq2dOcbAivAYyZ0tDPDpmYBRovuxfOa4LjIZRKIXDWyal19hqLHeMNHgrucfoG+7nztvxC22kAs+v6/9pI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1715742278; c=relaxed/simple;
 bh=omO/4cWg4WNcE1u67JLeBR4YnfEY/YrIy13+U0J/Bzc=;
 h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version;
 b=L2h9bKQ223CDf1jdvU16OKgQ3xCtHwSF0ufjznBMY9FUlk0xSnNL6FmT3DjjHBSVB6mMFoOOuRa1rjiPJU/jRR8dYdoOW4sgjf9+A21RFEi63XNi6TEZkwfwuFpvkUOOML0Vz3b/EdY13NrE9/4pYYKDG327ZKUomOeN2CEdmYo=
ARC-Authentication-Results: i=1; server2.sourceware.org
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
 d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
 t=1715742276; x=1747278276;
 h=from:to:cc:subject:date:message-id:in-reply-to:
 references:mime-version:content-transfer-encoding;
 bh=omO/4cWg4WNcE1u67JLeBR4YnfEY/YrIy13+U0J/Bzc=;
 b=PjnwOT6/L1Yvdh5LBzLSpUAD9ZIUkbMvRjq6JfXnKYfH1HgjtVKI0qA2
 ziucL+G8LTrcKXlFGasz5KHtU4VygZf9eJhhAMlYfRXmC3ZfTXkTb+8QZ
 50qscCY/Kje12MEjgys43vclN8VCzz4oAEGoIXt8MY+Gmv2mA8XNmaVUP
 i9zhxxiV2w8kmTyKEuiB/b1iNMkf53Qjuf0BArfGoS+1cZIODbPCvxmty
 Hf8UHlZ7nBH93zt1nOiNGfrXrD+7xqoLkYcTxw8aKJ1/CiaUi7TCJ/PQO
 qzujInXJzLvPcoxrRP+U+3fFMaJ8aNH2j2cau841cmDfIG0ulG9vdMdjI A==;
X-CSE-ConnectionGUID: UmwOvYplTjuQ4n0oAR3njA==
X-CSE-MsgGUID: dKDNE/I9Rdatl7TDMmf/xw==
X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="22369364"
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000"; d="scan'208";a="22369364"
Received: from orviesa006.jf.intel.com ([10.64.159.146])
 by fmvoesa105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 20:04:33 -0700
X-CSE-ConnectionGUID: AHy57JqiTp6kqv84aQ/sdg==
X-CSE-MsgGUID: YxwjOP07S22yE9nzHnneoA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000"; d="scan'208";a="31320622"
Received: from shvmail03.sh.intel.com ([10.239.245.20])
 by orviesa006.jf.intel.com with ESMTP; 14 May 2024 20:04:30 -0700
Received: from shliclel4217.sh.intel.com (shliclel4217.sh.intel.com
 [10.239.240.127])
 by shvmail03.sh.intel.com (Postfix) with ESMTP id 3C3A31007C39;
 Wed, 15 May 2024 11:04:29 +0800 (CST)
From: Haochen Jiang <haochen.jiang@intel.com>
To: gcc-patches@gcc.gnu.org
Cc: hongtao.liu@intel.com,
	ubizjak@gmail.com
Subject: [PATCH 2/2] Align tight&hot loop without considering max skipping
 bytes.
Date: Wed, 15 May 2024 11:04:29 +0800
Message-Id: <20240515030429.2575440-3-haochen.jiang@intel.com>
X-Mailer: git-send-email 2.31.1
In-Reply-To: <20240515030429.2575440-1-haochen.jiang@intel.com>
References: <20240515030429.2575440-1-haochen.jiang@intel.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-11.3 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org

From: liuhongt <hongtao.liu@intel.com>

When hot loop is small enough to fix into one cacheline, we should align
the loop with ceil_log2 (loop_size) without considering maximum
skipp bytes. It will help code prefetch.

gcc/ChangeLog:

	* config/i386/i386.cc (ix86_avoid_jump_mispredicts): Change
	gen_pad to gen_max_skip_align.
	(ix86_align_loops): New function.
	(ix86_reorg): Call ix86_align_loops.
	* config/i386/i386.md (pad): Rename to ..
	(max_skip_align): .. this, and accept 2 operands for align and
	skip.
---
 gcc/config/i386/i386.cc | 148 +++++++++++++++++++++++++++++++++++++++-
 gcc/config/i386/i386.md |  10 +--
 2 files changed, 153 insertions(+), 5 deletions(-)

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index e67e5f62533..c617091c8e1 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -23137,7 +23137,7 @@ ix86_avoid_jump_mispredicts (void)
 	  if (dump_file)
 	    fprintf (dump_file, "Padding insn %i by %i bytes!\n",
 		     INSN_UID (insn), padsize);
-          emit_insn_before (gen_pad (GEN_INT (padsize)), insn);
+	  emit_insn_before (gen_max_skip_align (GEN_INT (4), GEN_INT (padsize)), insn);
 	}
     }
 }
@@ -23410,6 +23410,150 @@ ix86_split_stlf_stall_load ()
     }
 }
 
+/* When a hot loop can be fit into one cacheline,
+   force align the loop without considering the max skip.  */
+static void
+ix86_align_loops ()
+{
+  basic_block bb;
+
+  /* Don't do this when we don't know cache line size.  */
+  if (ix86_cost->prefetch_block == 0)
+    return;
+
+  loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
+  profile_count count_threshold = cfun->cfg->count_max / param_align_threshold;
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      rtx_insn *label = BB_HEAD (bb);
+      bool has_fallthru = 0;
+      edge e;
+      edge_iterator ei;
+
+      if (!LABEL_P (label))
+	continue;
+
+      profile_count fallthru_count = profile_count::zero ();
+      profile_count branch_count = profile_count::zero ();
+
+      FOR_EACH_EDGE (e, ei, bb->preds)
+	{
+	  if (e->flags & EDGE_FALLTHRU)
+	    has_fallthru = 1, fallthru_count += e->count ();
+	  else
+	    branch_count += e->count ();
+	}
+
+      if (!fallthru_count.initialized_p () || !branch_count.initialized_p ())
+	continue;
+
+      if (bb->loop_father
+	  && bb->loop_father->latch != EXIT_BLOCK_PTR_FOR_FN (cfun)
+	  && (has_fallthru
+	      ? (!(single_succ_p (bb)
+		   && single_succ (bb) == EXIT_BLOCK_PTR_FOR_FN (cfun))
+		 && optimize_bb_for_speed_p (bb)
+		 && branch_count + fallthru_count > count_threshold
+		 && (branch_count > fallthru_count * param_align_loop_iterations))
+	      /* In case there'no fallthru for the loop.
+		 Nops inserted won't be executed.  */
+	      : (branch_count > count_threshold
+		 || (bb->count > bb->prev_bb->count * 10
+		     && (bb->prev_bb->count
+			 <= ENTRY_BLOCK_PTR_FOR_FN (cfun)->count / 2)))))
+	{
+	  rtx_insn* insn, *end_insn;
+	  HOST_WIDE_INT size = 0;
+	  bool padding_p = true;
+	  basic_block tbb = bb;
+	  unsigned cond_branch_num = 0;
+	  bool detect_tight_loop_p = false;
+
+	  for (unsigned int i = 0; i != bb->loop_father->num_nodes;
+	       i++, tbb = tbb->next_bb)
+	    {
+	      /* Only handle continuous cfg layout. */
+	      if (bb->loop_father != tbb->loop_father)
+		{
+		  padding_p = false;
+		  break;
+		}
+
+	      FOR_BB_INSNS (tbb, insn)
+		{
+		  if (!NONDEBUG_INSN_P (insn))
+		    continue;
+		  size += ix86_min_insn_size (insn);
+
+		  /* We don't know size of inline asm.
+		     Don't align loop for call.  */
+		  if (asm_noperands (PATTERN (insn)) >= 0
+		      || CALL_P (insn))
+		    {
+		      size = -1;
+		      break;
+		    }
+		}
+
+	      if (size == -1 || size > ix86_cost->prefetch_block)
+		{
+		  padding_p = false;
+		  break;
+		}
+
+	      FOR_EACH_EDGE (e, ei, tbb->succs)
+		{
+		  /* It could be part of the loop.  */
+		  if (e->dest == bb)
+		    {
+		      detect_tight_loop_p = true;
+		      break;
+		    }
+		}
+
+	      if (detect_tight_loop_p)
+		break;
+
+	      end_insn = BB_END (tbb);
+	      if (JUMP_P (end_insn))
+		{
+		  /* For decoded icache:
+		     1. Up to two branches are allowed per Way.
+		     2. A non-conditional branch is the last micro-op in a Way.
+		  */
+		  if (onlyjump_p (end_insn)
+		      && (any_uncondjump_p (end_insn)
+			  || single_succ_p (tbb)))
+		    {
+		      padding_p = false;
+		      break;
+		    }
+		  else if (++cond_branch_num >= 2)
+		    {
+		      padding_p = false;
+		      break;
+		    }
+		}
+
+	    }
+
+	  if (padding_p && detect_tight_loop_p)
+	    {
+	      emit_insn_before (gen_max_skip_align (GEN_INT (ceil_log2 (size)),
+						    GEN_INT (0)), label);
+	      /* End of function.  */
+	      if (!tbb || tbb == EXIT_BLOCK_PTR_FOR_FN (cfun))
+		break;
+	      /* Skip bb which already fits into one cacheline.  */
+	      bb = tbb;
+	    }
+	}
+    }
+
+  loop_optimizer_finalize ();
+  free_dominance_info (CDI_DOMINATORS);
+}
+
 /* Implement machine specific optimizations.  We implement padding of returns
    for K8 CPUs and pass to avoid 4 jumps in the single 16 byte window.  */
 static void
@@ -23433,6 +23577,8 @@ ix86_reorg (void)
 #ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
       if (TARGET_FOUR_JUMP_LIMIT)
 	ix86_avoid_jump_mispredicts ();
+
+      ix86_align_loops ();
 #endif
     }
 }
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 764bfe20ff2..686de0bf2ff 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -19150,16 +19150,18 @@
    (set_attr "length_immediate" "0")
    (set_attr "modrm" "0")])
 
-;; Pad to 16-byte boundary, max skip in op0.  Used to avoid
+;; Pad to 1 << op0 byte boundary, max skip in op1.  Used to avoid
 ;; branch prediction penalty for the third jump in a 16-byte
 ;; block on K8.
+;; Also it's used to align tight loops which can be fix into 1 cacheline.
+;; It can help code prefetch and reduce DSB miss.
 
-(define_insn "pad"
-  [(unspec_volatile [(match_operand 0)] UNSPECV_ALIGN)]
+(define_insn "max_skip_align"
+  [(unspec_volatile [(match_operand 0) (match_operand 1)] UNSPECV_ALIGN)]
   ""
 {
 #ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
-  ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file, 4, (int)INTVAL (operands[0]));
+  ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file, (int)INTVAL (operands[0]), (int)INTVAL (operands[1]));
 #else
   /* It is tempting to use ASM_OUTPUT_ALIGN here, but we don't want to do that.
      The align insn is used to avoid 3 jump instructions in the row to improve