From patchwork Fri Feb 23 14:15:49 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Sandiford X-Patchwork-Id: 1903317 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4ThBrb4kf7z23pm for ; Sat, 24 Feb 2024 01:16:15 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 98C13385828C for ; Fri, 23 Feb 2024 14:16:13 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id DB4D33858420 for ; Fri, 23 Feb 2024 14:15:50 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org DB4D33858420 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org DB4D33858420 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1708697752; cv=none; b=PVLlJiv09MNc3i8xSNO6wQPBHx+fvQ4CK8kpDt/loedpkcaytcD0mFp13zCQGmkThhGN8Z8A5Vz0I5ABsksN9otjY7tYcDpnBRmTOa+tL2qYyzLMHbzTCyPcbzKqqv0PCsGfFWsXjQkjCgW5wRhIR1VlS3L0YefE/q/KAHPPZs8= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1708697752; c=relaxed/simple; bh=YkM+YV7nekU4v8euVBf2h+dacZ0EYzm3eURQHvm7Nq4=; h=From:To:Subject:Date:Message-ID:MIME-Version; b=djWllb3SeoaF9xGx2PMAXoqElkM84zEgZqEj33RM2P4WYrQex1Y5SIpYFMFPmzwLjxpH5dalZYjlsJy2ryO2bD+KaP8izwyFUWk+ayLgkSvrOiur47dBLq/mtZ8YzZ1sXhgbpXJtZ7wI5co/C12oyNZ/SC2sVspUGoBi4m7mix0= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 37CBEDA7 for ; Fri, 23 Feb 2024 06:16:29 -0800 (PST) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 4102C3F73F for ; Fri, 23 Feb 2024 06:15:50 -0800 (PST) From: Richard Sandiford To: gcc-patches@gcc.gnu.org Mail-Followup-To: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com Subject: [pushed] aarch64: Spread out FPR usage between RA regions [PR113613] Date: Fri, 23 Feb 2024 14:15:49 +0000 Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 X-Spam-Status: No, score=-20.9 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, KAM_SHORT, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org early-ra already had code to do regrename-style "broadening" of the allocation, to promote scheduling freedom. However, the pass divides the function into allocation regions and this broadening only worked within a single region. This meant that if a basic block contained one subblock of FPR use, followed by a point at which no FPRs were live, followed by another subblock of FPR use, the two subblocks would tend to reuse the same registers. This in turn meant that it wasn't possible to form LDP/STP pairs between them. The failure to form LDPs and STPs in the testcase was a regression from GCC 13. The patch adds a simple heuristic to prefer less recently used registers in the event of a tie. Tested on aarch64-linux-gnu & pushed. Richard gcc/ PR target/113613 * config/aarch64/aarch64-early-ra.cc (early_ra::m_current_region): New member variable. (early_ra::m_fpr_recency): Likewise. (early_ra::start_new_region): Bump m_current_region. (early_ra::allocate_colors): Prefer less recently used registers in the event of a tie. Add a comment to explain why we prefer(ed) higher-numbered registers. (early_ra::find_oldest_color): Prefer less recently used registers here too. (early_ra::finalize_allocation): Update recency information for allocated registers. (early_ra::process_blocks): Initialize m_current_region and m_fpr_recency. gcc/testsuite/ PR target/113613 * gcc.target/aarch64/pr113613.c: New test. --- gcc/config/aarch64/aarch64-early-ra.cc | 55 +++++++++++++++++---- gcc/testsuite/gcc.target/aarch64/pr113613.c | 13 +++++ 2 files changed, 59 insertions(+), 9 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/pr113613.c diff --git a/gcc/config/aarch64/aarch64-early-ra.cc b/gcc/config/aarch64/aarch64-early-ra.cc index 9ac9ec1bb0d..8530b0ae41e 100644 --- a/gcc/config/aarch64/aarch64-early-ra.cc +++ b/gcc/config/aarch64/aarch64-early-ra.cc @@ -532,6 +532,12 @@ private: // The set of FPRs that are currently live. unsigned int m_live_fprs; + // A unique one-based identifier for the current region. + unsigned int m_current_region; + + // The region in which each FPR was last used, or 0 if none. + unsigned int m_fpr_recency[32]; + // ---------------------------------------------------------------------- // A mask of the FPRs that have already been allocated. @@ -1305,6 +1311,7 @@ early_ra::start_new_region () m_allocated_fprs = 0; m_call_preserved_fprs = 0; m_allocation_successful = true; + m_current_region += 1; } // Create and return an allocno group of size SIZE for register REGNO. @@ -2819,19 +2826,30 @@ early_ra::allocate_colors () candidates &= ~(m_allocated_fprs >> i); unsigned int best = INVALID_REGNUM; int best_weight = 0; + unsigned int best_recency = 0; for (unsigned int fpr = 0; fpr <= 32U - color->group->size; ++fpr) { if ((candidates & (1U << fpr)) == 0) continue; int weight = color->fpr_preferences[fpr]; + unsigned int recency = 0; // Account for registers that the current function must preserve. for (unsigned int i = 0; i < color->group->size; ++i) - if (m_call_preserved_fprs & (1U << (fpr + i))) - weight -= 1; - if (best == INVALID_REGNUM || best_weight <= weight) + { + if (m_call_preserved_fprs & (1U << (fpr + i))) + weight -= 1; + recency = MAX (recency, m_fpr_recency[fpr + i]); + } + // Prefer higher-numbered registers in the event of a tie. + // This should tend to keep lower-numbered registers free + // for allocnos that require V0-V7 or V0-V15. + if (best == INVALID_REGNUM + || best_weight < weight + || (best_weight == weight && recency <= best_recency)) { best = fpr; best_weight = weight; + best_recency = recency; } } @@ -2888,19 +2906,27 @@ early_ra::find_oldest_color (unsigned int first_color, { color_info *best = nullptr; unsigned int best_start_point = ~0U; + unsigned int best_recency = 0; for (unsigned int ci = first_color; ci < m_colors.length (); ++ci) { auto *color = m_colors[ci]; - if (fpr_conflicts & (1U << (color->hard_regno - V0_REGNUM))) + unsigned int fpr = color->hard_regno - V0_REGNUM; + if (fpr_conflicts & (1U << fpr)) continue; - if (!color->group) - return color; - auto chain_head = color->group->chain_heads ()[0]; - auto start_point = m_allocnos[chain_head]->start_point; - if (!best || best_start_point > start_point) + unsigned int start_point = 0; + if (color->group) + { + auto chain_head = color->group->chain_heads ()[0]; + start_point = m_allocnos[chain_head]->start_point; + } + unsigned int recency = m_fpr_recency[fpr]; + if (!best + || best_start_point > start_point + || (best_start_point == start_point && recency < best_recency)) { best = color; best_start_point = start_point; + best_recency = recency; } } return best; @@ -3004,6 +3030,13 @@ early_ra::broaden_colors () void early_ra::finalize_allocation () { + for (auto *color : m_colors) + if (color->group) + { + unsigned int fpr = color->hard_regno - V0_REGNUM; + for (unsigned int i = 0; i < color->group->size; ++i) + m_fpr_recency[fpr + i] = m_current_region; + } for (auto *allocno : m_allocnos) { if (allocno->is_shared ()) @@ -3521,6 +3554,10 @@ early_ra::process_blocks () bitmap_set_bit (fpr_pseudos_live_in, bb->index); } + // This is incremented by 1 at the start of each region. + m_current_region = 0; + memset (m_fpr_recency, 0, sizeof (m_fpr_recency)); + struct stack_node { edge_iterator ei; basic_block bb; }; auto_vec stack; diff --git a/gcc/testsuite/gcc.target/aarch64/pr113613.c b/gcc/testsuite/gcc.target/aarch64/pr113613.c new file mode 100644 index 00000000000..382e4a11c0a --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr113613.c @@ -0,0 +1,13 @@ +// { dg-options "-O2" } + +typedef float __attribute__((vector_size(8))) v2sf; +v2sf a[4]; +v2sf b[4]; +void f() +{ + b[0] += a[0]; + b[1] += a[1]; +} + +// { dg-final { scan-assembler-times {\tldp\t} 2 } } +// { dg-final { scan-assembler-times {\tstp\t} 1 } }