From patchwork Wed May 15 03:04:29 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jiang, Haochen" X-Patchwork-Id: 1935256 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.a=rsa-sha256 header.s=Intel header.b=PjnwOT6/; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4VfJ4H4XDzz20dB for ; Wed, 15 May 2024 13:05:03 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id C4E9F3846422 for ; Wed, 15 May 2024 03:05:01 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.11]) by sourceware.org (Postfix) with ESMTPS id 8D8F13858D35 for ; Wed, 15 May 2024 03:04:35 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8D8F13858D35 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=intel.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 8D8F13858D35 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=192.198.163.11 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1715742278; cv=none; b=kSK3d1tbCZVCWEnw//JfhOI8cdLD6upzaZMGCR18cc0dYEj+QKVtF9xql9lOMBRg8HfHhgbO9J9OQhhk2s7QIgQqSvq2dOcbAivAYyZ0tDPDpmYBRovuxfOa4LjIZRKIXDWyal19hqLHeMNHgrucfoG+7nztvxC22kAs+v6/9pI= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1715742278; c=relaxed/simple; bh=omO/4cWg4WNcE1u67JLeBR4YnfEY/YrIy13+U0J/Bzc=; h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version; b=L2h9bKQ223CDf1jdvU16OKgQ3xCtHwSF0ufjznBMY9FUlk0xSnNL6FmT3DjjHBSVB6mMFoOOuRa1rjiPJU/jRR8dYdoOW4sgjf9+A21RFEi63XNi6TEZkwfwuFpvkUOOML0Vz3b/EdY13NrE9/4pYYKDG327ZKUomOeN2CEdmYo= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1715742276; x=1747278276; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=omO/4cWg4WNcE1u67JLeBR4YnfEY/YrIy13+U0J/Bzc=; b=PjnwOT6/L1Yvdh5LBzLSpUAD9ZIUkbMvRjq6JfXnKYfH1HgjtVKI0qA2 ziucL+G8LTrcKXlFGasz5KHtU4VygZf9eJhhAMlYfRXmC3ZfTXkTb+8QZ 50qscCY/Kje12MEjgys43vclN8VCzz4oAEGoIXt8MY+Gmv2mA8XNmaVUP i9zhxxiV2w8kmTyKEuiB/b1iNMkf53Qjuf0BArfGoS+1cZIODbPCvxmty Hf8UHlZ7nBH93zt1nOiNGfrXrD+7xqoLkYcTxw8aKJ1/CiaUi7TCJ/PQO qzujInXJzLvPcoxrRP+U+3fFMaJ8aNH2j2cau841cmDfIG0ulG9vdMdjI A==; X-CSE-ConnectionGUID: UmwOvYplTjuQ4n0oAR3njA== X-CSE-MsgGUID: dKDNE/I9Rdatl7TDMmf/xw== X-IronPort-AV: E=McAfee;i="6600,9927,11073"; a="22369364" X-IronPort-AV: E=Sophos;i="6.08,160,1712646000"; d="scan'208";a="22369364" Received: from orviesa006.jf.intel.com ([10.64.159.146]) by fmvoesa105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 May 2024 20:04:33 -0700 X-CSE-ConnectionGUID: AHy57JqiTp6kqv84aQ/sdg== X-CSE-MsgGUID: YxwjOP07S22yE9nzHnneoA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.08,160,1712646000"; d="scan'208";a="31320622" Received: from shvmail03.sh.intel.com ([10.239.245.20]) by orviesa006.jf.intel.com with ESMTP; 14 May 2024 20:04:30 -0700 Received: from shliclel4217.sh.intel.com (shliclel4217.sh.intel.com [10.239.240.127]) by shvmail03.sh.intel.com (Postfix) with ESMTP id 3C3A31007C39; Wed, 15 May 2024 11:04:29 +0800 (CST) From: Haochen Jiang To: gcc-patches@gcc.gnu.org Cc: hongtao.liu@intel.com, ubizjak@gmail.com Subject: [PATCH 2/2] Align tight&hot loop without considering max skipping bytes. Date: Wed, 15 May 2024 11:04:29 +0800 Message-Id: <20240515030429.2575440-3-haochen.jiang@intel.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20240515030429.2575440-1-haochen.jiang@intel.com> References: <20240515030429.2575440-1-haochen.jiang@intel.com> MIME-Version: 1.0 X-Spam-Status: No, score=-11.3 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org From: liuhongt When hot loop is small enough to fix into one cacheline, we should align the loop with ceil_log2 (loop_size) without considering maximum skipp bytes. It will help code prefetch. gcc/ChangeLog: * config/i386/i386.cc (ix86_avoid_jump_mispredicts): Change gen_pad to gen_max_skip_align. (ix86_align_loops): New function. (ix86_reorg): Call ix86_align_loops. * config/i386/i386.md (pad): Rename to .. (max_skip_align): .. this, and accept 2 operands for align and skip. --- gcc/config/i386/i386.cc | 148 +++++++++++++++++++++++++++++++++++++++- gcc/config/i386/i386.md | 10 +-- 2 files changed, 153 insertions(+), 5 deletions(-) diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index e67e5f62533..c617091c8e1 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -23137,7 +23137,7 @@ ix86_avoid_jump_mispredicts (void) if (dump_file) fprintf (dump_file, "Padding insn %i by %i bytes!\n", INSN_UID (insn), padsize); - emit_insn_before (gen_pad (GEN_INT (padsize)), insn); + emit_insn_before (gen_max_skip_align (GEN_INT (4), GEN_INT (padsize)), insn); } } } @@ -23410,6 +23410,150 @@ ix86_split_stlf_stall_load () } } +/* When a hot loop can be fit into one cacheline, + force align the loop without considering the max skip. */ +static void +ix86_align_loops () +{ + basic_block bb; + + /* Don't do this when we don't know cache line size. */ + if (ix86_cost->prefetch_block == 0) + return; + + loop_optimizer_init (AVOID_CFG_MODIFICATIONS); + profile_count count_threshold = cfun->cfg->count_max / param_align_threshold; + FOR_EACH_BB_FN (bb, cfun) + { + rtx_insn *label = BB_HEAD (bb); + bool has_fallthru = 0; + edge e; + edge_iterator ei; + + if (!LABEL_P (label)) + continue; + + profile_count fallthru_count = profile_count::zero (); + profile_count branch_count = profile_count::zero (); + + FOR_EACH_EDGE (e, ei, bb->preds) + { + if (e->flags & EDGE_FALLTHRU) + has_fallthru = 1, fallthru_count += e->count (); + else + branch_count += e->count (); + } + + if (!fallthru_count.initialized_p () || !branch_count.initialized_p ()) + continue; + + if (bb->loop_father + && bb->loop_father->latch != EXIT_BLOCK_PTR_FOR_FN (cfun) + && (has_fallthru + ? (!(single_succ_p (bb) + && single_succ (bb) == EXIT_BLOCK_PTR_FOR_FN (cfun)) + && optimize_bb_for_speed_p (bb) + && branch_count + fallthru_count > count_threshold + && (branch_count > fallthru_count * param_align_loop_iterations)) + /* In case there'no fallthru for the loop. + Nops inserted won't be executed. */ + : (branch_count > count_threshold + || (bb->count > bb->prev_bb->count * 10 + && (bb->prev_bb->count + <= ENTRY_BLOCK_PTR_FOR_FN (cfun)->count / 2))))) + { + rtx_insn* insn, *end_insn; + HOST_WIDE_INT size = 0; + bool padding_p = true; + basic_block tbb = bb; + unsigned cond_branch_num = 0; + bool detect_tight_loop_p = false; + + for (unsigned int i = 0; i != bb->loop_father->num_nodes; + i++, tbb = tbb->next_bb) + { + /* Only handle continuous cfg layout. */ + if (bb->loop_father != tbb->loop_father) + { + padding_p = false; + break; + } + + FOR_BB_INSNS (tbb, insn) + { + if (!NONDEBUG_INSN_P (insn)) + continue; + size += ix86_min_insn_size (insn); + + /* We don't know size of inline asm. + Don't align loop for call. */ + if (asm_noperands (PATTERN (insn)) >= 0 + || CALL_P (insn)) + { + size = -1; + break; + } + } + + if (size == -1 || size > ix86_cost->prefetch_block) + { + padding_p = false; + break; + } + + FOR_EACH_EDGE (e, ei, tbb->succs) + { + /* It could be part of the loop. */ + if (e->dest == bb) + { + detect_tight_loop_p = true; + break; + } + } + + if (detect_tight_loop_p) + break; + + end_insn = BB_END (tbb); + if (JUMP_P (end_insn)) + { + /* For decoded icache: + 1. Up to two branches are allowed per Way. + 2. A non-conditional branch is the last micro-op in a Way. + */ + if (onlyjump_p (end_insn) + && (any_uncondjump_p (end_insn) + || single_succ_p (tbb))) + { + padding_p = false; + break; + } + else if (++cond_branch_num >= 2) + { + padding_p = false; + break; + } + } + + } + + if (padding_p && detect_tight_loop_p) + { + emit_insn_before (gen_max_skip_align (GEN_INT (ceil_log2 (size)), + GEN_INT (0)), label); + /* End of function. */ + if (!tbb || tbb == EXIT_BLOCK_PTR_FOR_FN (cfun)) + break; + /* Skip bb which already fits into one cacheline. */ + bb = tbb; + } + } + } + + loop_optimizer_finalize (); + free_dominance_info (CDI_DOMINATORS); +} + /* Implement machine specific optimizations. We implement padding of returns for K8 CPUs and pass to avoid 4 jumps in the single 16 byte window. */ static void @@ -23433,6 +23577,8 @@ ix86_reorg (void) #ifdef ASM_OUTPUT_MAX_SKIP_ALIGN if (TARGET_FOUR_JUMP_LIMIT) ix86_avoid_jump_mispredicts (); + + ix86_align_loops (); #endif } } diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 764bfe20ff2..686de0bf2ff 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -19150,16 +19150,18 @@ (set_attr "length_immediate" "0") (set_attr "modrm" "0")]) -;; Pad to 16-byte boundary, max skip in op0. Used to avoid +;; Pad to 1 << op0 byte boundary, max skip in op1. Used to avoid ;; branch prediction penalty for the third jump in a 16-byte ;; block on K8. +;; Also it's used to align tight loops which can be fix into 1 cacheline. +;; It can help code prefetch and reduce DSB miss. -(define_insn "pad" - [(unspec_volatile [(match_operand 0)] UNSPECV_ALIGN)] +(define_insn "max_skip_align" + [(unspec_volatile [(match_operand 0) (match_operand 1)] UNSPECV_ALIGN)] "" { #ifdef ASM_OUTPUT_MAX_SKIP_ALIGN - ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file, 4, (int)INTVAL (operands[0])); + ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file, (int)INTVAL (operands[0]), (int)INTVAL (operands[1])); #else /* It is tempting to use ASM_OUTPUT_ALIGN here, but we don't want to do that. The align insn is used to avoid 3 jump instructions in the row to improve