From patchwork Wed May 15 03:04:28 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Jiang, Haochen" <haochen.jiang@intel.com>
X-Patchwork-Id: 1935254
Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (2048-bit key;
 unprotected) header.d=intel.com header.i=@intel.com header.a=rsa-sha256
 header.s=Intel header.b=kbs9ZEjU;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4VfJ4G5Y07z20KD
	for <incoming@patchwork.ozlabs.org>; Wed, 15 May 2024 13:05:01 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 9AE7A3849ADF
	for <incoming@patchwork.ozlabs.org>; Wed, 15 May 2024 03:04:59 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.11])
 by sourceware.org (Postfix) with ESMTPS id 1FE33385840D
 for <gcc-patches@gcc.gnu.org>; Wed, 15 May 2024 03:04:34 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 1FE33385840D
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=intel.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 1FE33385840D
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=192.198.163.11
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1715742275; cv=none;
 b=s8WOnVXfe2Bu1DaJ38VoYLQiUIU8AvNjf0rI96+fyHnaWLLdLK7hG3Qzmjhrbvif8jI7ivNCUimqChl3nErwjGYqx8O3XJeuCQ26dnVPMWdrYvzVxJnFzq3I8Ko9+Jvwxbv9bL5pD0ovLikUHNh5ZbE6qDpC5ELLpryS7QfjJLI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1715742275; c=relaxed/simple;
 bh=zIDupfwXr51UhWoYDpUnIhdUZP90cKVOxxgMm3MhH1E=;
 h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version;
 b=UCN83hXJtf8DBxeiTVe/z2I2+YMIYtjS/r4eMAoRWp/u5KyS7l6IF+mU4TcqKYVFJmkQvwIdkvZHqDaf+PSiHjMaFUSmsXUyrRRggCR7gCr9uzK29DR+oh6oJqkc5gI66jKfCPPmYBu4pM8ulrQc139gMv1jNAGjDCyGkDYVtaU=
ARC-Authentication-Results: i=1; server2.sourceware.org
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
 d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
 t=1715742274; x=1747278274;
 h=from:to:cc:subject:date:message-id:in-reply-to:
 references:mime-version:content-transfer-encoding;
 bh=zIDupfwXr51UhWoYDpUnIhdUZP90cKVOxxgMm3MhH1E=;
 b=kbs9ZEjUPnvNeSi5GwEoX11iWqa29CLdT2KyxGn4p/Sg3o/XF1LZsAJE
 PrC57na/LNg/pPn+mXSX1yLgJ175S/qFxHpzss8evtoylBmXz+p+ES33P
 /Og2T+Wgzqe7/DZ6F5h8p4BFKEl2NZls37D1mqdSzsv9DK01iL0gre1FZ
 tujpXC9eeuhPW04Mvd3c5VP9FywKJSGsn0kR/yNWh1dsC9rcmVJeemzYH
 /YQtDs6bdruA9moV8IRPZqKjSh3AfAaL5efa6/XJq5OPw6CwTszVjlUyd
 t87JnXXeHRZt6YKZWqOHxlmtLOwo2dOM5JRDYofWh7S+qW1BEpR2Fhm2m w==;
X-CSE-ConnectionGUID: PJq38pwdRCOZmPb4+G1D2A==
X-CSE-MsgGUID: 370ZqLBJQzOJnZv96PGCQQ==
X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="22369362"
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000"; d="scan'208";a="22369362"
Received: from orviesa006.jf.intel.com ([10.64.159.146])
 by fmvoesa105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 20:04:32 -0700
X-CSE-ConnectionGUID: 50yBtQ3uS2aYZbNz/D0zig==
X-CSE-MsgGUID: X/W4IogARjW9HUZKVF1QjA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000"; d="scan'208";a="31320617"
Received: from shvmail03.sh.intel.com ([10.239.245.20])
 by orviesa006.jf.intel.com with ESMTP; 14 May 2024 20:04:30 -0700
Received: from shliclel4217.sh.intel.com (shliclel4217.sh.intel.com
 [10.239.240.127])
 by shvmail03.sh.intel.com (Postfix) with ESMTP id 3A7CB10077F0;
 Wed, 15 May 2024 11:04:29 +0800 (CST)
From: Haochen Jiang <haochen.jiang@intel.com>
To: gcc-patches@gcc.gnu.org
Cc: hongtao.liu@intel.com,
	ubizjak@gmail.com
Subject: [PATCH 1/2] Adjust generic loop alignment from 16:11:8 to 16 for
 Intel processors
Date: Wed, 15 May 2024 11:04:28 +0800
Message-Id: <20240515030429.2575440-2-haochen.jiang@intel.com>
X-Mailer: git-send-email 2.31.1
In-Reply-To: <20240515030429.2575440-1-haochen.jiang@intel.com>
References: <20240515030429.2575440-1-haochen.jiang@intel.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-11.3 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org

Previously, we use 16:11:8 in generic tune for Intel processors, which
lead to cross cache line issue and result in some random performance
penalty in benchmarks with small loops commit to commit.

After changing to always aligning to 16 bytes, it will somehow solve
the issue.

gcc/ChangeLog:

	* config/i386/x86-tune-costs.h (generic_cost): Change from
	16:11:8 to 16.
---
 gcc/config/i386/x86-tune-costs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index 65d7d1f7e42..d3aaaa4b5cc 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -3758,7 +3758,7 @@ struct processor_costs generic_cost = {
   generic_memset,
   COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
   COSTS_N_INSNS (2),			/* cond_not_taken_branch_cost.  */
-  "16:11:8",				/* Loop alignment.  */
+  "16",					/* Loop alignment.  */
   "16:11:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */

From patchwork Wed May 15 03:04:29 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Jiang, Haochen" <haochen.jiang@intel.com>
X-Patchwork-Id: 1935256
Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (2048-bit key;
 unprotected) header.d=intel.com header.i=@intel.com header.a=rsa-sha256
 header.s=Intel header.b=PjnwOT6/;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4VfJ4H4XDzz20dB
	for <incoming@patchwork.ozlabs.org>; Wed, 15 May 2024 13:05:03 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id C4E9F3846422
	for <incoming@patchwork.ozlabs.org>; Wed, 15 May 2024 03:05:01 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.11])
 by sourceware.org (Postfix) with ESMTPS id 8D8F13858D35
 for <gcc-patches@gcc.gnu.org>; Wed, 15 May 2024 03:04:35 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8D8F13858D35
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=intel.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 8D8F13858D35
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=192.198.163.11
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1715742278; cv=none;
 b=kSK3d1tbCZVCWEnw//JfhOI8cdLD6upzaZMGCR18cc0dYEj+QKVtF9xql9lOMBRg8HfHhgbO9J9OQhhk2s7QIgQqSvq2dOcbAivAYyZ0tDPDpmYBRovuxfOa4LjIZRKIXDWyal19hqLHeMNHgrucfoG+7nztvxC22kAs+v6/9pI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1715742278; c=relaxed/simple;
 bh=omO/4cWg4WNcE1u67JLeBR4YnfEY/YrIy13+U0J/Bzc=;
 h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version;
 b=L2h9bKQ223CDf1jdvU16OKgQ3xCtHwSF0ufjznBMY9FUlk0xSnNL6FmT3DjjHBSVB6mMFoOOuRa1rjiPJU/jRR8dYdoOW4sgjf9+A21RFEi63XNi6TEZkwfwuFpvkUOOML0Vz3b/EdY13NrE9/4pYYKDG327ZKUomOeN2CEdmYo=
ARC-Authentication-Results: i=1; server2.sourceware.org
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
 d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
 t=1715742276; x=1747278276;
 h=from:to:cc:subject:date:message-id:in-reply-to:
 references:mime-version:content-transfer-encoding;
 bh=omO/4cWg4WNcE1u67JLeBR4YnfEY/YrIy13+U0J/Bzc=;
 b=PjnwOT6/L1Yvdh5LBzLSpUAD9ZIUkbMvRjq6JfXnKYfH1HgjtVKI0qA2
 ziucL+G8LTrcKXlFGasz5KHtU4VygZf9eJhhAMlYfRXmC3ZfTXkTb+8QZ
 50qscCY/Kje12MEjgys43vclN8VCzz4oAEGoIXt8MY+Gmv2mA8XNmaVUP
 i9zhxxiV2w8kmTyKEuiB/b1iNMkf53Qjuf0BArfGoS+1cZIODbPCvxmty
 Hf8UHlZ7nBH93zt1nOiNGfrXrD+7xqoLkYcTxw8aKJ1/CiaUi7TCJ/PQO
 qzujInXJzLvPcoxrRP+U+3fFMaJ8aNH2j2cau841cmDfIG0ulG9vdMdjI A==;
X-CSE-ConnectionGUID: UmwOvYplTjuQ4n0oAR3njA==
X-CSE-MsgGUID: dKDNE/I9Rdatl7TDMmf/xw==
X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="22369364"
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000"; d="scan'208";a="22369364"
Received: from orviesa006.jf.intel.com ([10.64.159.146])
 by fmvoesa105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 May 2024 20:04:33 -0700
X-CSE-ConnectionGUID: AHy57JqiTp6kqv84aQ/sdg==
X-CSE-MsgGUID: YxwjOP07S22yE9nzHnneoA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.08,160,1712646000"; d="scan'208";a="31320622"
Received: from shvmail03.sh.intel.com ([10.239.245.20])
 by orviesa006.jf.intel.com with ESMTP; 14 May 2024 20:04:30 -0700
Received: from shliclel4217.sh.intel.com (shliclel4217.sh.intel.com
 [10.239.240.127])
 by shvmail03.sh.intel.com (Postfix) with ESMTP id 3C3A31007C39;
 Wed, 15 May 2024 11:04:29 +0800 (CST)
From: Haochen Jiang <haochen.jiang@intel.com>
To: gcc-patches@gcc.gnu.org
Cc: hongtao.liu@intel.com,
	ubizjak@gmail.com
Subject: [PATCH 2/2] Align tight&hot loop without considering max skipping
 bytes.
Date: Wed, 15 May 2024 11:04:29 +0800
Message-Id: <20240515030429.2575440-3-haochen.jiang@intel.com>
X-Mailer: git-send-email 2.31.1
In-Reply-To: <20240515030429.2575440-1-haochen.jiang@intel.com>
References: <20240515030429.2575440-1-haochen.jiang@intel.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-11.3 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org

From: liuhongt <hongtao.liu@intel.com>

When hot loop is small enough to fix into one cacheline, we should align
the loop with ceil_log2 (loop_size) without considering maximum
skipp bytes. It will help code prefetch.

gcc/ChangeLog:

	* config/i386/i386.cc (ix86_avoid_jump_mispredicts): Change
	gen_pad to gen_max_skip_align.
	(ix86_align_loops): New function.
	(ix86_reorg): Call ix86_align_loops.
	* config/i386/i386.md (pad): Rename to ..
	(max_skip_align): .. this, and accept 2 operands for align and
	skip.
---
 gcc/config/i386/i386.cc | 148 +++++++++++++++++++++++++++++++++++++++-
 gcc/config/i386/i386.md |  10 +--
 2 files changed, 153 insertions(+), 5 deletions(-)

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index e67e5f62533..c617091c8e1 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -23137,7 +23137,7 @@ ix86_avoid_jump_mispredicts (void)
 	  if (dump_file)
 	    fprintf (dump_file, "Padding insn %i by %i bytes!\n",
 		     INSN_UID (insn), padsize);
-          emit_insn_before (gen_pad (GEN_INT (padsize)), insn);
+	  emit_insn_before (gen_max_skip_align (GEN_INT (4), GEN_INT (padsize)), insn);
 	}
     }
 }
@@ -23410,6 +23410,150 @@ ix86_split_stlf_stall_load ()
     }
 }
 
+/* When a hot loop can be fit into one cacheline,
+   force align the loop without considering the max skip.  */
+static void
+ix86_align_loops ()
+{
+  basic_block bb;
+
+  /* Don't do this when we don't know cache line size.  */
+  if (ix86_cost->prefetch_block == 0)
+    return;
+
+  loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
+  profile_count count_threshold = cfun->cfg->count_max / param_align_threshold;
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      rtx_insn *label = BB_HEAD (bb);
+      bool has_fallthru = 0;
+      edge e;
+      edge_iterator ei;
+
+      if (!LABEL_P (label))
+	continue;
+
+      profile_count fallthru_count = profile_count::zero ();
+      profile_count branch_count = profile_count::zero ();
+
+      FOR_EACH_EDGE (e, ei, bb->preds)
+	{
+	  if (e->flags & EDGE_FALLTHRU)
+	    has_fallthru = 1, fallthru_count += e->count ();
+	  else
+	    branch_count += e->count ();
+	}
+
+      if (!fallthru_count.initialized_p () || !branch_count.initialized_p ())
+	continue;
+
+      if (bb->loop_father
+	  && bb->loop_father->latch != EXIT_BLOCK_PTR_FOR_FN (cfun)
+	  && (has_fallthru
+	      ? (!(single_succ_p (bb)
+		   && single_succ (bb) == EXIT_BLOCK_PTR_FOR_FN (cfun))
+		 && optimize_bb_for_speed_p (bb)
+		 && branch_count + fallthru_count > count_threshold
+		 && (branch_count > fallthru_count * param_align_loop_iterations))
+	      /* In case there'no fallthru for the loop.
+		 Nops inserted won't be executed.  */
+	      : (branch_count > count_threshold
+		 || (bb->count > bb->prev_bb->count * 10
+		     && (bb->prev_bb->count
+			 <= ENTRY_BLOCK_PTR_FOR_FN (cfun)->count / 2)))))
+	{
+	  rtx_insn* insn, *end_insn;
+	  HOST_WIDE_INT size = 0;
+	  bool padding_p = true;
+	  basic_block tbb = bb;
+	  unsigned cond_branch_num = 0;
+	  bool detect_tight_loop_p = false;
+
+	  for (unsigned int i = 0; i != bb->loop_father->num_nodes;
+	       i++, tbb = tbb->next_bb)
+	    {
+	      /* Only handle continuous cfg layout. */
+	      if (bb->loop_father != tbb->loop_father)
+		{
+		  padding_p = false;
+		  break;
+		}
+
+	      FOR_BB_INSNS (tbb, insn)
+		{
+		  if (!NONDEBUG_INSN_P (insn))
+		    continue;
+		  size += ix86_min_insn_size (insn);
+
+		  /* We don't know size of inline asm.
+		     Don't align loop for call.  */
+		  if (asm_noperands (PATTERN (insn)) >= 0
+		      || CALL_P (insn))
+		    {
+		      size = -1;
+		      break;
+		    }
+		}
+
+	      if (size == -1 || size > ix86_cost->prefetch_block)
+		{
+		  padding_p = false;
+		  break;
+		}
+
+	      FOR_EACH_EDGE (e, ei, tbb->succs)
+		{
+		  /* It could be part of the loop.  */
+		  if (e->dest == bb)
+		    {
+		      detect_tight_loop_p = true;
+		      break;
+		    }
+		}
+
+	      if (detect_tight_loop_p)
+		break;
+
+	      end_insn = BB_END (tbb);
+	      if (JUMP_P (end_insn))
+		{
+		  /* For decoded icache:
+		     1. Up to two branches are allowed per Way.
+		     2. A non-conditional branch is the last micro-op in a Way.
+		  */
+		  if (onlyjump_p (end_insn)
+		      && (any_uncondjump_p (end_insn)
+			  || single_succ_p (tbb)))
+		    {
+		      padding_p = false;
+		      break;
+		    }
+		  else if (++cond_branch_num >= 2)
+		    {
+		      padding_p = false;
+		      break;
+		    }
+		}
+
+	    }
+
+	  if (padding_p && detect_tight_loop_p)
+	    {
+	      emit_insn_before (gen_max_skip_align (GEN_INT (ceil_log2 (size)),
+						    GEN_INT (0)), label);
+	      /* End of function.  */
+	      if (!tbb || tbb == EXIT_BLOCK_PTR_FOR_FN (cfun))
+		break;
+	      /* Skip bb which already fits into one cacheline.  */
+	      bb = tbb;
+	    }
+	}
+    }
+
+  loop_optimizer_finalize ();
+  free_dominance_info (CDI_DOMINATORS);
+}
+
 /* Implement machine specific optimizations.  We implement padding of returns
    for K8 CPUs and pass to avoid 4 jumps in the single 16 byte window.  */
 static void
@@ -23433,6 +23577,8 @@ ix86_reorg (void)
 #ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
       if (TARGET_FOUR_JUMP_LIMIT)
 	ix86_avoid_jump_mispredicts ();
+
+      ix86_align_loops ();
 #endif
     }
 }
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 764bfe20ff2..686de0bf2ff 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -19150,16 +19150,18 @@
    (set_attr "length_immediate" "0")
    (set_attr "modrm" "0")])
 
-;; Pad to 16-byte boundary, max skip in op0.  Used to avoid
+;; Pad to 1 << op0 byte boundary, max skip in op1.  Used to avoid
 ;; branch prediction penalty for the third jump in a 16-byte
 ;; block on K8.
+;; Also it's used to align tight loops which can be fix into 1 cacheline.
+;; It can help code prefetch and reduce DSB miss.
 
-(define_insn "pad"
-  [(unspec_volatile [(match_operand 0)] UNSPECV_ALIGN)]
+(define_insn "max_skip_align"
+  [(unspec_volatile [(match_operand 0) (match_operand 1)] UNSPECV_ALIGN)]
   ""
 {
 #ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
-  ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file, 4, (int)INTVAL (operands[0]));
+  ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file, (int)INTVAL (operands[0]), (int)INTVAL (operands[1]));
 #else
   /* It is tempting to use ASM_OUTPUT_ALIGN here, but we don't want to do that.
      The align insn is used to avoid 3 jump instructions in the row to improve