From patchwork Tue Dec 6 11:14:23 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jan Hubicka X-Patchwork-Id: 1712683 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.a=rsa-sha256 header.s=default header.b=pp9I8/wG; dkim-atps=neutral Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4NRHrK6tblz23ns for ; Tue, 6 Dec 2022 22:14:56 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 6E63D3850B1F for ; Tue, 6 Dec 2022 11:14:54 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6E63D3850B1F DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1670325294; bh=DbXp1huc+qm9gxxWHA1t/Wzk6C/Tj3nGcpaUdLDFMP0=; h=Date:To:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=pp9I8/wGuG6CI2sabrfut2gln2wW8lRnFS5AotFNYA/zvuVdsHv0GrAWtORoGphLj 7u9gfQ6IJrGde49k2mvvv56+ba44Gadl9cIPsQeDUx45YQ/IOD3tad1rVYuZCQzp9f 4wQ7VFZc/aRYWKq7r5+Q7dKW46YYsVyloGFnqNO0= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from nikam.ms.mff.cuni.cz (nikam.ms.mff.cuni.cz [195.113.20.16]) by sourceware.org (Postfix) with ESMTPS id D218F3850B0B for ; Tue, 6 Dec 2022 11:14:24 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D218F3850B0B Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202) id 24891280891; Tue, 6 Dec 2022 12:14:23 +0100 (CET) Date: Tue, 6 Dec 2022 12:14:23 +0100 To: gcc-patches@gcc.gnu.org, mjambor@suse.cz, Alexander Monakov , "Kumar, Venkataramanan" , Tejas Sanjay , rguenther@suse.de Subject: Zen4 tuning part 2 - tuning flags Message-ID: MIME-Version: 1.0 Content-Disposition: inline X-Spam-Status: No, score=-11.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, GIT_PATCH_0, HEADER_FROM_DIFFERENT_DOMAINS, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Jan Hubicka via Gcc-patches From: Jan Hubicka Reply-To: Jan Hubicka Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Sender: "Gcc-patches" Hi, this patch adds tunes needed for zen4 microarchitecture. I added two new knobs. TARGET_AVX512_SPLIT_REGS which is used to specify that internally 512 vectors are split to 256 vectors. This affects vectorization costs and reassociation width. It probably should also affect RTX costs however I doubt it is very useful since RTL optimizers are usually not judging between 256 and 512 vectors. I also added X86_TUNE_AVOID_256FMA_CHAINS. Since fma has improved in zen4 this flag may not be a win except for very specific benchmarks. I am still doing some more detailed testing here. Oherwise I disabled gathers on zen4 for 2 parts nad 4 parts. We can open code them and since the latencies has only increased since zen3 opencoding is better than actual instrucction. This shows at 4 tsvc benchmarks. I ended up setting AVX256_OPTIMAL. This is a compromise. There are some tsvc benchmarks that increase noticeably (up to 250%) however there are also few regressions. Most of these can be solved by incrasing vec_perm cost in the vectorizer. However this does not cure about 14% regression on x264 that is quite important. Here we produce vectorized loops for avx512 that probably would be faster if the loops in question had high enough iteration count. We hit this problem with avx256 too: since the loop iterates few times, only prologues/epilogues are used. Adding another round of prologue/epilogue code does not make it better. Finally I enabled avx stores for constnat sized memcpy and memset. I am not sure why this is an opt-in feature. I think for most hardware this is a win. Bootstrapped/regtested x86_64-linux. I am waiting for final SPEC benchmarking to finish and plan to commit the patch and do possible additional tuning incrementally. * config/i386/i386-expand.cc (ix86_expand_set_or_cpymem): Add TARGET_AVX512_SPLIT_REGS * config/i386/i386-options.cc (ix86_option_override_internal): Honor x86_TONE_AVOID_256FMA_CHAINS. * config/i386/i386.cc (ix86_vec_cost): Honor TARGET_AVX512_SPLIT_REGS. (ix86_reassociation_width): Likewise. * i386.h (TARGET_AVX512_SPLIT_REGS): New tune. * x86-tune.def (X86_TUNE_GATHER_2PARTS, X86_TUNE_GATHER_4PARTS): Disable for zen4. (X86_TUNE_AVOID_256FMA_CHAINS): New; set for znver4. (x86_TUNE_AVOID_512FMA_CHAINS): New tune; set for znver4. (x86_TUNE_AVX256_MOVE_BY_PIECES): Add Zens 1-3. (x86_TUNE_AVX256_STORE_BY_PIECES): Add Zens 1-3. (x86_TUNE_AVX512_MOVE_BY_PIECES): Add Zen 4. (x86_TUNE_AVX512_STORE_BY_PIECES): Add Zen 4. diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index b920cfbaf90..ae58619cd12 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -8660,6 +8660,8 @@ ix86_expand_set_or_cpymem (rtx dst, rtx src, rtx count_exp, rtx val_exp, if (TARGET_AVX256_SPLIT_REGS && GET_MODE_BITSIZE (move_mode) > 128) move_mode = TImode; + if (TARGET_AVX512_SPLIT_REGS && GET_MODE_BITSIZE (move_mode) > 256) + move_mode = OImode; /* Find the corresponding vector mode with the same size as MOVE_MODE. MOVE_MODE is an integer mode at the moment (SI, DI, TI, etc.). */ diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc index 44dcccb0a73..c188835b231 100644 --- a/gcc/config/i386/i386-options.cc +++ b/gcc/config/i386/i386-options.cc @@ -2980,6 +2980,8 @@ ix86_option_override_internal (bool main_args_p, } if (ix86_tune_features [X86_TUNE_AVOID_256FMA_CHAINS]) + SET_OPTION_IF_UNSET (opts, opts_set, param_avoid_fma_max_bits, 512); + else if (ix86_tune_features [X86_TUNE_AVOID_256FMA_CHAINS]) SET_OPTION_IF_UNSET (opts, opts_set, param_avoid_fma_max_bits, 256); else if (ix86_tune_features [X86_TUNE_AVOID_128FMA_CHAINS]) SET_OPTION_IF_UNSET (opts, opts_set, param_avoid_fma_max_bits, 128); diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index 95babd93c9d..8fa94cc9837 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -20363,10 +20363,13 @@ ix86_vec_cost (machine_mode mode, int cost) if (GET_MODE_BITSIZE (mode) == 128 && TARGET_SSE_SPLIT_REGS) - return cost * 2; - if (GET_MODE_BITSIZE (mode) > 128 + return cost * GET_MODE_BITSIZE (mode) / 64; + else if (GET_MODE_BITSIZE (mode) > 128 && TARGET_AVX256_SPLIT_REGS) return cost * GET_MODE_BITSIZE (mode) / 128; + else if (GET_MODE_BITSIZE (mode) > 256 + && TARGET_AVX512_SPLIT_REGS) + return cost * GET_MODE_BITSIZE (mode) / 256; return cost; } @@ -23090,7 +23093,9 @@ ix86_reassociation_width (unsigned int op, machine_mode mode) return 1; /* Account for targets that splits wide vectors into multiple parts. */ - if (TARGET_AVX256_SPLIT_REGS && GET_MODE_BITSIZE (mode) > 128) + if (TARGET_AVX512_SPLIT_REGS && GET_MODE_BITSIZE (mode) > 256) + div = GET_MODE_BITSIZE (mode) / 256; + else if (TARGET_AVX256_SPLIT_REGS && GET_MODE_BITSIZE (mode) > 128) div = GET_MODE_BITSIZE (mode) / 128; else if (TARGET_SSE_SPLIT_REGS && GET_MODE_BITSIZE (mode) > 64) div = GET_MODE_BITSIZE (mode) / 64; diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index d865fcb9466..e6a603ed31a 100644 --- a/gcc/config/i386/i386.h +++ b/gcc/config/i386/i386.h @@ -419,6 +419,8 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST]; ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL] #define TARGET_AVX256_SPLIT_REGS \ ix86_tune_features[X86_TUNE_AVX256_SPLIT_REGS] +#define TARGET_AVX512_SPLIT_REGS \ + ix86_tune_features[X86_TUNE_AVX512_SPLIT_REGS] #define TARGET_GENERAL_REGS_SSE_SPILL \ ix86_tune_features[X86_TUNE_GENERAL_REGS_SSE_SPILL] #define TARGET_AVOID_MEM_OPND_FOR_CMOVE \ diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def index cd66f335113..5ce2c914aa0 100644 --- a/gcc/config/i386/x86-tune.def +++ b/gcc/config/i386/x86-tune.def @@ -481,12 +481,12 @@ DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes", /* X86_TUNE_USE_GATHER_2PARTS: Use gather instructions for vectors with 2 elements. */ DEF_TUNE (X86_TUNE_USE_GATHER_2PARTS, "use_gather_2parts", - ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ALDERLAKE | m_CORE_ATOM | m_GENERIC)) + ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4 | m_ALDERLAKE | m_CORE_ATOM | m_GENERIC)) /* X86_TUNE_USE_GATHER_4PARTS: Use gather instructions for vectors with 4 elements. */ DEF_TUNE (X86_TUNE_USE_GATHER_4PARTS, "use_gather_4parts", - ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ALDERLAKE | m_CORE_ATOM | m_GENERIC)) + ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4 | m_ALDERLAKE | m_CORE_ATOM | m_GENERIC)) /* X86_TUNE_USE_GATHER: Use gather instructions for vectors with 8 or more elements. */ @@ -499,7 +499,9 @@ DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER) /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or smaller FMA chain. */ -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3) +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3 | m_ZNVER4) + +DEF_TUNE (X86_TUNE_AVOID_512FMA_CHAINS, "avoid_fma512_chains", m_ZNVER4) /* X86_TUNE_V2DF_REDUCTION_PREFER_PHADDPD: Prefer haddpd for v2df vector reduction. */ @@ -531,27 +533,30 @@ DEF_TUNE (X86_TUNE_AVX128_OPTIMAL, "avx128_optimal", m_BDVER | m_BTVER2 /* X86_TUNE_AVX256_OPTIMAL: Use 256-bit AVX instructions instead of 512-bit AVX instructions in the auto-vectorizer. */ -DEF_TUNE (X86_TUNE_AVX256_OPTIMAL, "avx256_optimal", m_CORE_AVX512) +DEF_TUNE (X86_TUNE_AVX256_OPTIMAL, "avx256_optimal", m_CORE_AVX512 | m_ZNVER4) + +/* X86_TUNE_AVX256_SPLIT_REGS: if true, AVX512 ops are split into two AVX256 ops. */ +DEF_TUNE (X86_TUNE_AVX512_SPLIT_REGS, "avx512_split_regs", m_ZNVER4) /* X86_TUNE_AVX256_MOVE_BY_PIECES: Optimize move_by_pieces with 256-bit AVX instructions. */ DEF_TUNE (X86_TUNE_AVX256_MOVE_BY_PIECES, "avx256_move_by_pieces", - m_ALDERLAKE | m_CORE_AVX2) + m_ALDERLAKE | m_CORE_AVX2 | m_ZNVER1 | m_ZNVER2 | m_ZNVER3) /* X86_TUNE_AVX256_STORE_BY_PIECES: Optimize store_by_pieces with 256-bit AVX instructions. */ DEF_TUNE (X86_TUNE_AVX256_STORE_BY_PIECES, "avx256_store_by_pieces", - m_ALDERLAKE | m_CORE_AVX2) + m_ALDERLAKE | m_CORE_AVX2 | m_ZNVER1 | m_ZNVER2 | m_ZNVER3) /* X86_TUNE_AVX512_MOVE_BY_PIECES: Optimize move_by_pieces with 512-bit AVX instructions. */ DEF_TUNE (X86_TUNE_AVX512_MOVE_BY_PIECES, "avx512_move_by_pieces", - m_SAPPHIRERAPIDS) + m_SAPPHIRERAPIDS | m_ZNVER4) /* X86_TUNE_AVX512_STORE_BY_PIECES: Optimize store_by_pieces with 512-bit AVX instructions. */ DEF_TUNE (X86_TUNE_AVX512_STORE_BY_PIECES, "avx512_store_by_pieces", - m_SAPPHIRERAPIDS) + m_SAPPHIRERAPIDS | m_ZNVER4) /*****************************************************************************/ /*****************************************************************************/