From patchwork Thu Sep 5 12:21:59 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jan Hubicka X-Patchwork-Id: 1981224 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=ucw.cz header.i=@ucw.cz header.a=rsa-sha256 header.s=gen1 header.b=I5gWP5ni; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4Wzz5N0l9Bz1ygP for ; Thu, 5 Sep 2024 22:22:32 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id EABCF38654B7 for ; Thu, 5 Sep 2024 12:22:29 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from nikam.ms.mff.cuni.cz (nikam.ms.mff.cuni.cz [195.113.20.16]) by sourceware.org (Postfix) with ESMTPS id EFC143850200 for ; Thu, 5 Sep 2024 12:22:01 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org EFC143850200 Authentication-Results: sourceware.org; dmarc=fail (p=none dis=none) header.from=ucw.cz Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=kam.mff.cuni.cz ARC-Filter: OpenARC Filter v1.0.0 sourceware.org EFC143850200 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=195.113.20.16 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1725538924; cv=none; b=V//NyGgT6QuAFyXq47YMlm6WU85a2nzkPYEZGv7d6xOBIonQO6fwWh99KLvAcrX/4Adrfzlop1Sy+6N5OiBtoJZ5jauuDC432z6JEMMnO6TXKVYHHg3gE0qccw0nVPsamwE4HL3UMdQM3LqebeK8158G7zWpIXQUf8Are3mHwXI= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1725538924; c=relaxed/simple; bh=hYJpJvkt0vR5ATl4J/0U8P6ah8SPtoTjA0u7U/TwjuM=; h=DKIM-Signature:Date:From:To:Subject:Message-ID:MIME-Version; b=wz0NCkFxoSDAHgHilhGX0poBil7Yt8gVStOREgbv97E8ccmaFrEhUqNElQp1W65N/pDNvfxOW4TDiuusRF960uUYUBqLVepZKykveKlQP1ITvnGqRDGOx3m8ZDGJUZLKQP03txEMx7JPVk3jnDMlYw6xdWXHpzmR4EGAPSq8zxA= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202) id D05B7287A65; Thu, 5 Sep 2024 14:21:59 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ucw.cz; s=gen1; t=1725538919; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type; bh=Nc7JESA/wHvyqBbvRNxjlnN15Fk6SflFfNRiz0ph1W8=; b=I5gWP5niJL1rBS46WMCVY3pL3EcYMCWlEFRWtWjkjXMx4ealJ9Mg0poztN/u74BU5gtpRg SbRvPAspZWVGZsrwCUuaqMayfhNrIU8Yml4q8IciRWo6Yc4GnfO7oHnnK0pFHOzjdoP17q rbR6wMifKWlUkSD2rKghChMR7kfmMcY= Date: Thu, 5 Sep 2024 14:21:59 +0200 From: Jan Hubicka To: gcc-patches@gcc.gnu.org Subject: Zen5 tuning part 5: update instruction latencies in x86-tune-costs Message-ID: MIME-Version: 1.0 Content-Disposition: inline X-Spam-Status: No, score=-10.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, GIT_PATCH_0, HEADER_FROM_DIFFERENT_DOMAINS, JMQ_SPF_NEUTRAL, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org Hi, there is nothing exciting in this patch. I measured latencies and also compared them with newly released optimization guide and it seems that only important change is that addss is fastr now. It can be 2 cycles instaead of 3 in some cases when the input parameter is computed by another addition. The throughput has increased but we have no model for that. I added comments whic should make it easier to update the table for future revisions. I also increased the large insn bound since decoders seems no longer require instructions to be 8 bytes or less. Bootstrapped/rgtested x86_64-linux, comitted. gcc/ChangeLog: * config/i386/x86-tune-costs.h (znver5_cost): Update instruction costs. diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index b90567fbbf2..1b3227ace16 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -2034,6 +2034,7 @@ struct processor_costs znver5_cost = { COSTS_N_INSNS (1), /* cost of a lea instruction. */ COSTS_N_INSNS (1), /* variable shift costs. */ COSTS_N_INSNS (1), /* constant shift costs. */ + /* mul has latency 3, executes in 3 integer units. */ {COSTS_N_INSNS (3), /* cost of starting multiply for QI. */ COSTS_N_INSNS (3), /* HI. */ COSTS_N_INSNS (3), /* SI. */ @@ -2041,6 +2042,8 @@ struct processor_costs znver5_cost = { COSTS_N_INSNS (3)}, /* other. */ 0, /* cost of multiply per each bit set. */ + /* integer divide has latency of 8 cycles + plus 1 for every 9 bits of quotient. */ {COSTS_N_INSNS (10), /* cost of a divide/mod for QI. */ COSTS_N_INSNS (11), /* HI. */ COSTS_N_INSNS (13), /* SI. */ @@ -2048,7 +2051,7 @@ struct processor_costs znver5_cost = { COSTS_N_INSNS (16)}, /* other. */ COSTS_N_INSNS (1), /* cost of movsx. */ COSTS_N_INSNS (1), /* cost of movzx. */ - 8, /* "large" insn. */ + 15, /* "large" insn. */ 9, /* MOVE_RATIO. */ 6, /* CLEAR_RATIO */ {6, 6, 6}, /* cost of loading integer registers @@ -2065,12 +2068,13 @@ struct processor_costs znver5_cost = { 2, 2, 2, /* cost of moving XMM,YMM,ZMM register. */ 6, /* cost of moving SSE register to integer. */ - /* VGATHERDPD is 17 uops and throughput is 4, VGATHERDPS is 24 uops, - throughput 5. Approx 7 uops do not depend on vector size and every load - is 5 uops. */ + + /* TODO: gather and scatter instructions are currently disabled in + x86-tune.def. In some cases they are however a win, see PR116582 + We however need good cost model for them. */ 14, 10, /* Gather load static, per_elt. */ 14, 20, /* Gather store static, per_elt. */ - 32, /* size of l1 cache. */ + 48, /* size of l1 cache. */ 1024, /* size of l2 cache. */ 64, /* size of prefetch block. */ /* New AMD processors never drop prefetches; if they cannot be performed @@ -2080,6 +2084,8 @@ struct processor_costs znver5_cost = { time). */ 100, /* number of parallel prefetches. */ 3, /* Branch cost. */ + /* TODO x87 latencies are still based on znver4. + Probably not very important these days. */ COSTS_N_INSNS (7), /* cost of FADD and FSUB insns. */ COSTS_N_INSNS (7), /* cost of FMUL instruction. */ /* Latency of fdiv is 8-15. */ @@ -2089,16 +2095,24 @@ struct processor_costs znver5_cost = { /* Latency of fsqrt is 4-10. */ COSTS_N_INSNS (25), /* cost of FSQRT instruction. */ + /* SSE instructions have typical throughput 4 and latency 1. */ COSTS_N_INSNS (1), /* cost of cheap SSE instruction. */ - COSTS_N_INSNS (3), /* cost of ADDSS/SD SUBSS/SD insns. */ + /* ADDSS has throughput 2 and latency 2 + (in some cases when source is another addition). */ + COSTS_N_INSNS (2), /* cost of ADDSS/SD SUBSS/SD insns. */ + /* MULSS has throughput 2 and latency 3. */ COSTS_N_INSNS (3), /* cost of MULSS instruction. */ COSTS_N_INSNS (3), /* cost of MULSD instruction. */ + /* FMA had throughput 2 and latency 4. */ COSTS_N_INSNS (4), /* cost of FMA SS instruction. */ COSTS_N_INSNS (4), /* cost of FMA SD instruction. */ + /* DIVSS has throughtput 0.4 and latency 10. */ COSTS_N_INSNS (10), /* cost of DIVSS instruction. */ - /* 9-13. */ + /* DIVSD has throughtput 0.25 and latency 13. */ COSTS_N_INSNS (13), /* cost of DIVSD instruction. */ + /* DIVSD has throughtput 0.22 and latency 14. */ COSTS_N_INSNS (14), /* cost of SQRTSS instruction. */ + /* DIVSD has throughtput 0.13 and latency 20. */ COSTS_N_INSNS (20), /* cost of SQRTSD instruction. */ /* Zen5 can execute: - integer ops: 6 per cycle, at most 3 multiplications.