From patchwork Wed May 15 03:04:28 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jiang, Haochen" X-Patchwork-Id: 1935254 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.a=rsa-sha256 header.s=Intel header.b=kbs9ZEjU; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4VfJ4G5Y07z20KD for ; Wed, 15 May 2024 13:05:01 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 9AE7A3849ADF for ; Wed, 15 May 2024 03:04:59 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.11]) by sourceware.org (Postfix) with ESMTPS id 1FE33385840D for ; Wed, 15 May 2024 03:04:34 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 1FE33385840D Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=intel.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 1FE33385840D Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=192.198.163.11 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1715742275; cv=none; b=s8WOnVXfe2Bu1DaJ38VoYLQiUIU8AvNjf0rI96+fyHnaWLLdLK7hG3Qzmjhrbvif8jI7ivNCUimqChl3nErwjGYqx8O3XJeuCQ26dnVPMWdrYvzVxJnFzq3I8Ko9+Jvwxbv9bL5pD0ovLikUHNh5ZbE6qDpC5ELLpryS7QfjJLI= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1715742275; c=relaxed/simple; bh=zIDupfwXr51UhWoYDpUnIhdUZP90cKVOxxgMm3MhH1E=; h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version; b=UCN83hXJtf8DBxeiTVe/z2I2+YMIYtjS/r4eMAoRWp/u5KyS7l6IF+mU4TcqKYVFJmkQvwIdkvZHqDaf+PSiHjMaFUSmsXUyrRRggCR7gCr9uzK29DR+oh6oJqkc5gI66jKfCPPmYBu4pM8ulrQc139gMv1jNAGjDCyGkDYVtaU= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1715742274; x=1747278274; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=zIDupfwXr51UhWoYDpUnIhdUZP90cKVOxxgMm3MhH1E=; b=kbs9ZEjUPnvNeSi5GwEoX11iWqa29CLdT2KyxGn4p/Sg3o/XF1LZsAJE PrC57na/LNg/pPn+mXSX1yLgJ175S/qFxHpzss8evtoylBmXz+p+ES33P /Og2T+Wgzqe7/DZ6F5h8p4BFKEl2NZls37D1mqdSzsv9DK01iL0gre1FZ tujpXC9eeuhPW04Mvd3c5VP9FywKJSGsn0kR/yNWh1dsC9rcmVJeemzYH /YQtDs6bdruA9moV8IRPZqKjSh3AfAaL5efa6/XJq5OPw6CwTszVjlUyd t87JnXXeHRZt6YKZWqOHxlmtLOwo2dOM5JRDYofWh7S+qW1BEpR2Fhm2m w==; X-CSE-ConnectionGUID: PJq38pwdRCOZmPb4+G1D2A== X-CSE-MsgGUID: 370ZqLBJQzOJnZv96PGCQQ== X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="22369362" X-IronPort-AV: E=Sophos;i="6.08,160,1712646000"; d="scan'208";a="22369362" Received: from orviesa006.jf.intel.com ([10.64.159.146]) by fmvoesa105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 May 2024 20:04:32 -0700 X-CSE-ConnectionGUID: 50yBtQ3uS2aYZbNz/D0zig== X-CSE-MsgGUID: X/W4IogARjW9HUZKVF1QjA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.08,160,1712646000"; d="scan'208";a="31320617" Received: from shvmail03.sh.intel.com ([10.239.245.20]) by orviesa006.jf.intel.com with ESMTP; 14 May 2024 20:04:30 -0700 Received: from shliclel4217.sh.intel.com (shliclel4217.sh.intel.com [10.239.240.127]) by shvmail03.sh.intel.com (Postfix) with ESMTP id 3A7CB10077F0; Wed, 15 May 2024 11:04:29 +0800 (CST) From: Haochen Jiang To: gcc-patches@gcc.gnu.org Cc: hongtao.liu@intel.com, ubizjak@gmail.com Subject: [PATCH 1/2] Adjust generic loop alignment from 16:11:8 to 16 for Intel processors Date: Wed, 15 May 2024 11:04:28 +0800 Message-Id: <20240515030429.2575440-2-haochen.jiang@intel.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20240515030429.2575440-1-haochen.jiang@intel.com> References: <20240515030429.2575440-1-haochen.jiang@intel.com> MIME-Version: 1.0 X-Spam-Status: No, score=-11.3 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Previously, we use 16:11:8 in generic tune for Intel processors, which lead to cross cache line issue and result in some random performance penalty in benchmarks with small loops commit to commit. After changing to always aligning to 16 bytes, it will somehow solve the issue. gcc/ChangeLog: * config/i386/x86-tune-costs.h (generic_cost): Change from 16:11:8 to 16. --- gcc/config/i386/x86-tune-costs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index 65d7d1f7e42..d3aaaa4b5cc 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -3758,7 +3758,7 @@ struct processor_costs generic_cost = { generic_memset, COSTS_N_INSNS (4), /* cond_taken_branch_cost. */ COSTS_N_INSNS (2), /* cond_not_taken_branch_cost. */ - "16:11:8", /* Loop alignment. */ + "16", /* Loop alignment. */ "16:11:8", /* Jump alignment. */ "0:0:8", /* Label alignment. */ "16", /* Func alignment. */ From patchwork Wed May 15 03:04:29 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jiang, Haochen" X-Patchwork-Id: 1935256 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.a=rsa-sha256 header.s=Intel header.b=PjnwOT6/; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4VfJ4H4XDzz20dB for ; Wed, 15 May 2024 13:05:03 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id C4E9F3846422 for ; Wed, 15 May 2024 03:05:01 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.11]) by sourceware.org (Postfix) with ESMTPS id 8D8F13858D35 for ; Wed, 15 May 2024 03:04:35 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8D8F13858D35 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=intel.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 8D8F13858D35 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=192.198.163.11 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1715742278; cv=none; b=kSK3d1tbCZVCWEnw//JfhOI8cdLD6upzaZMGCR18cc0dYEj+QKVtF9xql9lOMBRg8HfHhgbO9J9OQhhk2s7QIgQqSvq2dOcbAivAYyZ0tDPDpmYBRovuxfOa4LjIZRKIXDWyal19hqLHeMNHgrucfoG+7nztvxC22kAs+v6/9pI= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1715742278; c=relaxed/simple; bh=omO/4cWg4WNcE1u67JLeBR4YnfEY/YrIy13+U0J/Bzc=; h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version; b=L2h9bKQ223CDf1jdvU16OKgQ3xCtHwSF0ufjznBMY9FUlk0xSnNL6FmT3DjjHBSVB6mMFoOOuRa1rjiPJU/jRR8dYdoOW4sgjf9+A21RFEi63XNi6TEZkwfwuFpvkUOOML0Vz3b/EdY13NrE9/4pYYKDG327ZKUomOeN2CEdmYo= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1715742276; x=1747278276; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=omO/4cWg4WNcE1u67JLeBR4YnfEY/YrIy13+U0J/Bzc=; b=PjnwOT6/L1Yvdh5LBzLSpUAD9ZIUkbMvRjq6JfXnKYfH1HgjtVKI0qA2 ziucL+G8LTrcKXlFGasz5KHtU4VygZf9eJhhAMlYfRXmC3ZfTXkTb+8QZ 50qscCY/Kje12MEjgys43vclN8VCzz4oAEGoIXt8MY+Gmv2mA8XNmaVUP i9zhxxiV2w8kmTyKEuiB/b1iNMkf53Qjuf0BArfGoS+1cZIODbPCvxmty Hf8UHlZ7nBH93zt1nOiNGfrXrD+7xqoLkYcTxw8aKJ1/CiaUi7TCJ/PQO qzujInXJzLvPcoxrRP+U+3fFMaJ8aNH2j2cau841cmDfIG0ulG9vdMdjI A==; X-CSE-ConnectionGUID: UmwOvYplTjuQ4n0oAR3njA== X-CSE-MsgGUID: dKDNE/I9Rdatl7TDMmf/xw== X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="22369364" X-IronPort-AV: E=Sophos;i="6.08,160,1712646000"; d="scan'208";a="22369364" Received: from orviesa006.jf.intel.com ([10.64.159.146]) by fmvoesa105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 May 2024 20:04:33 -0700 X-CSE-ConnectionGUID: AHy57JqiTp6kqv84aQ/sdg== X-CSE-MsgGUID: YxwjOP07S22yE9nzHnneoA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.08,160,1712646000"; d="scan'208";a="31320622" Received: from shvmail03.sh.intel.com ([10.239.245.20]) by orviesa006.jf.intel.com with ESMTP; 14 May 2024 20:04:30 -0700 Received: from shliclel4217.sh.intel.com (shliclel4217.sh.intel.com [10.239.240.127]) by shvmail03.sh.intel.com (Postfix) with ESMTP id 3C3A31007C39; Wed, 15 May 2024 11:04:29 +0800 (CST) From: Haochen Jiang To: gcc-patches@gcc.gnu.org Cc: hongtao.liu@intel.com, ubizjak@gmail.com Subject: [PATCH 2/2] Align tight&hot loop without considering max skipping bytes. Date: Wed, 15 May 2024 11:04:29 +0800 Message-Id: <20240515030429.2575440-3-haochen.jiang@intel.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20240515030429.2575440-1-haochen.jiang@intel.com> References: <20240515030429.2575440-1-haochen.jiang@intel.com> MIME-Version: 1.0 X-Spam-Status: No, score=-11.3 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org From: liuhongt When hot loop is small enough to fix into one cacheline, we should align the loop with ceil_log2 (loop_size) without considering maximum skipp bytes. It will help code prefetch. gcc/ChangeLog: * config/i386/i386.cc (ix86_avoid_jump_mispredicts): Change gen_pad to gen_max_skip_align. (ix86_align_loops): New function. (ix86_reorg): Call ix86_align_loops. * config/i386/i386.md (pad): Rename to .. (max_skip_align): .. this, and accept 2 operands for align and skip. --- gcc/config/i386/i386.cc | 148 +++++++++++++++++++++++++++++++++++++++- gcc/config/i386/i386.md | 10 +-- 2 files changed, 153 insertions(+), 5 deletions(-) diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index e67e5f62533..c617091c8e1 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -23137,7 +23137,7 @@ ix86_avoid_jump_mispredicts (void) if (dump_file) fprintf (dump_file, "Padding insn %i by %i bytes!\n", INSN_UID (insn), padsize); - emit_insn_before (gen_pad (GEN_INT (padsize)), insn); + emit_insn_before (gen_max_skip_align (GEN_INT (4), GEN_INT (padsize)), insn); } } } @@ -23410,6 +23410,150 @@ ix86_split_stlf_stall_load () } } +/* When a hot loop can be fit into one cacheline, + force align the loop without considering the max skip. */ +static void +ix86_align_loops () +{ + basic_block bb; + + /* Don't do this when we don't know cache line size. */ + if (ix86_cost->prefetch_block == 0) + return; + + loop_optimizer_init (AVOID_CFG_MODIFICATIONS); + profile_count count_threshold = cfun->cfg->count_max / param_align_threshold; + FOR_EACH_BB_FN (bb, cfun) + { + rtx_insn *label = BB_HEAD (bb); + bool has_fallthru = 0; + edge e; + edge_iterator ei; + + if (!LABEL_P (label)) + continue; + + profile_count fallthru_count = profile_count::zero (); + profile_count branch_count = profile_count::zero (); + + FOR_EACH_EDGE (e, ei, bb->preds) + { + if (e->flags & EDGE_FALLTHRU) + has_fallthru = 1, fallthru_count += e->count (); + else + branch_count += e->count (); + } + + if (!fallthru_count.initialized_p () || !branch_count.initialized_p ()) + continue; + + if (bb->loop_father + && bb->loop_father->latch != EXIT_BLOCK_PTR_FOR_FN (cfun) + && (has_fallthru + ? (!(single_succ_p (bb) + && single_succ (bb) == EXIT_BLOCK_PTR_FOR_FN (cfun)) + && optimize_bb_for_speed_p (bb) + && branch_count + fallthru_count > count_threshold + && (branch_count > fallthru_count * param_align_loop_iterations)) + /* In case there'no fallthru for the loop. + Nops inserted won't be executed. */ + : (branch_count > count_threshold + || (bb->count > bb->prev_bb->count * 10 + && (bb->prev_bb->count + <= ENTRY_BLOCK_PTR_FOR_FN (cfun)->count / 2))))) + { + rtx_insn* insn, *end_insn; + HOST_WIDE_INT size = 0; + bool padding_p = true; + basic_block tbb = bb; + unsigned cond_branch_num = 0; + bool detect_tight_loop_p = false; + + for (unsigned int i = 0; i != bb->loop_father->num_nodes; + i++, tbb = tbb->next_bb) + { + /* Only handle continuous cfg layout. */ + if (bb->loop_father != tbb->loop_father) + { + padding_p = false; + break; + } + + FOR_BB_INSNS (tbb, insn) + { + if (!NONDEBUG_INSN_P (insn)) + continue; + size += ix86_min_insn_size (insn); + + /* We don't know size of inline asm. + Don't align loop for call. */ + if (asm_noperands (PATTERN (insn)) >= 0 + || CALL_P (insn)) + { + size = -1; + break; + } + } + + if (size == -1 || size > ix86_cost->prefetch_block) + { + padding_p = false; + break; + } + + FOR_EACH_EDGE (e, ei, tbb->succs) + { + /* It could be part of the loop. */ + if (e->dest == bb) + { + detect_tight_loop_p = true; + break; + } + } + + if (detect_tight_loop_p) + break; + + end_insn = BB_END (tbb); + if (JUMP_P (end_insn)) + { + /* For decoded icache: + 1. Up to two branches are allowed per Way. + 2. A non-conditional branch is the last micro-op in a Way. + */ + if (onlyjump_p (end_insn) + && (any_uncondjump_p (end_insn) + || single_succ_p (tbb))) + { + padding_p = false; + break; + } + else if (++cond_branch_num >= 2) + { + padding_p = false; + break; + } + } + + } + + if (padding_p && detect_tight_loop_p) + { + emit_insn_before (gen_max_skip_align (GEN_INT (ceil_log2 (size)), + GEN_INT (0)), label); + /* End of function. */ + if (!tbb || tbb == EXIT_BLOCK_PTR_FOR_FN (cfun)) + break; + /* Skip bb which already fits into one cacheline. */ + bb = tbb; + } + } + } + + loop_optimizer_finalize (); + free_dominance_info (CDI_DOMINATORS); +} + /* Implement machine specific optimizations. We implement padding of returns for K8 CPUs and pass to avoid 4 jumps in the single 16 byte window. */ static void @@ -23433,6 +23577,8 @@ ix86_reorg (void) #ifdef ASM_OUTPUT_MAX_SKIP_ALIGN if (TARGET_FOUR_JUMP_LIMIT) ix86_avoid_jump_mispredicts (); + + ix86_align_loops (); #endif } } diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 764bfe20ff2..686de0bf2ff 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -19150,16 +19150,18 @@ (set_attr "length_immediate" "0") (set_attr "modrm" "0")]) -;; Pad to 16-byte boundary, max skip in op0. Used to avoid +;; Pad to 1 << op0 byte boundary, max skip in op1. Used to avoid ;; branch prediction penalty for the third jump in a 16-byte ;; block on K8. +;; Also it's used to align tight loops which can be fix into 1 cacheline. +;; It can help code prefetch and reduce DSB miss. -(define_insn "pad" - [(unspec_volatile [(match_operand 0)] UNSPECV_ALIGN)] +(define_insn "max_skip_align" + [(unspec_volatile [(match_operand 0) (match_operand 1)] UNSPECV_ALIGN)] "" { #ifdef ASM_OUTPUT_MAX_SKIP_ALIGN - ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file, 4, (int)INTVAL (operands[0])); + ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file, (int)INTVAL (operands[0]), (int)INTVAL (operands[1])); #else /* It is tempting to use ASM_OUTPUT_ALIGN here, but we don't want to do that. The align insn is used to avoid 3 jump instructions in the row to improve